A Beginner’s Guide to Databricks
Databricks allows data scientists to simply create and manage notebooks for research, experimentation, and deployment. The appeal of platforms like Databricks includes seamless integration with cloud services, tooling for model maintenance, and scalability.
Databricks could be very useful for model experimentation and maintenance. Databricks has a machine learning library, called MLflow, that gives useful tooling for model development and deployment. With MLflow, you possibly can log models in addition to metadata related to the models comparable to performance metrics and hyperparameters. This makes it very straightforward to run experiments and analyze results.
Many Databricks features are useful for scaling steps throughout the machine learning workflow comparable to data loading, model training, and model logging. Koalas is a library in Databricks that may be a more efficient alternative to pandas. Pandas User-defined functions (UDF) can help you apply custom functions, which are often computationally costly, in a distributed manner which might significantly reduce runtime. Databricks also lets you configure jobs on larger machines which could be useful for coping with large data and heavy computation. Further, the model registry lets you run and store experiment results for a whole lot and even hundreds of models. This is beneficial when it comes to scaling the variety of models which are researcher develops and eventually deploys.
In this text, we’ll cover a number of the basics of Databricks. First, we’ll walk through an easy data science workflow where we’ll construct a churn classification model. We’ll then see how we are able to use tools like Koalas and Pandas UDF to hurry up specific operations. Finally, we’ll see how we are able to use Mlflow to assist us run experiments and inspect results.
Here, we can be working with the Telco churn data set. This data comprises customer billing information for a fictional Telco company. It specifies whether a customer stopped or continued using the service, often called churning. The info is publicly available and is free to make use of, share and modify under the Apache 2.0 license.
Getting Began
To begin, navigate to the Databricks website and click on on “Get Began for Free”:
You need to see the next:
Enter your information and click on proceed. Next you can be prompted to pick a cloud platform. We won’t be working with any external cloud platforms in this text. At the underside of the right-hand panel click on “Get Began with Community Edition”
Next follow the steps to create a Community Edition Account.
Importing Data
Let’s start by navigating to the ‘data’ tab in left-hand panel:
Next click on ‘data’ after which click on create table:
Next drag and drop the churn CSV file within the space where it says “Drop files to upload, or click to browse”
Upon uploading the CSV it is best to see the next:
Next click on “Create Table in Notebook”. A Databricks filestore (DBFS) example notebook with logic for writing this file to the Databricks filestore will pop up:
DBFS allows Databricks users to upload and manage data. The system is distributed so it is vitally useful for storing and managing large amounts of information.
The primary cell specifies logic for reading the Churn data we uploaded:
# File location and sort
file_location = "/FileStore/tables/telco_churn-1.csv"
file_type = "csv"# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","
# The applied options are for CSV files. For other file types, these can be ignored.
df = spark.read.format(file_type)
.option("inferSchema", infer_schema)
.option("header", first_row_is_header)
.option("sep", delimiter)
.load(file_location)
display(df)
If we run this cell we get the next result:
We see that the table includes column names that aren’t very useful (_c0, _c1, … etc). To repair this we want to specify first_row_is_header= “true”:
first_row_is_header = "true"
Once we run this cell, we now get:
If you happen to click on the table you possibly can scroll to the precise and see the extra columns in the information:
Constructing a Classification Model
Let’s proceed by constructing a churn classification model using our uploaded data in Databricks. On the left hand panel click on ‘create’:
Next click on notebook:
Let’s name our notebook “churn_model”:
Now we are able to copy the logic from the DBFS example notebook allowing us to access the information:
Next let’s convert the spark dataframe right into a Pandas dataframe:
df_pandas = df.toPandas()
Let’s construct a Catboost classification model. Catboost is a tree-based ensemble machine learning algorithm that uses gradient boosting to enhance the performance of the successive trees utilized in the ensemble.
Let’s pip install the Catboost package. We do that in a cell at the highest of the notebook:
And let’s construct a Catboost churn classification model. Let’s use tenure, monthly charges, and contract to predict churn final result. Let’s convert the churn column to binary values:
import numpy as np
df_pandas['churn_label'] = np.where(df_pandas['Churn']== 'No', 0, 1)
X = df_pandas[["tenure", "MonthlyCharges", "Contract"]]
y = df_pandas['churn_label']
Catboost allows us to handle categorical variables directly without the necessity to convert them to machine readable codes. To do that we just define a listing that comprises the names of the specific columns:
cats = ["Contract"]
When defining the Catboost model object we set the cat_features parameter equal to this list. Let’s split our data for training and testing:
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
And we are able to train out Catboost model. We’ll just use default parameter values:
model = CatBoostClassifier(cat_features= cats, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
And we are able to evaluate performance:
from sklearn.metrics import accuracy_score, precision_score
accuracy = accuracy_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)print("Accuracy: ", accuracy )
print("Precision: ", precision )
Koalas
Here we converted a spark dataframe to a pandas dataframe. That is effective for our small data set, but as data size grows Pandas becomes slow and inefficient. An alternative choice to Pandas is the Koalas library. Koalas is a package developed by Databricks that may be a distributed version of Pandas. To make use of Koalas we are able to pip install Databricks at the highest of our notebook:
%pip install -U databricks
And we import Koalas from databricks:
from databricks import koalas as ks
And to convert our spark dataframe to a Koalas dataframe we do the next:
df_koalas = ks.DataFrame(df)
df_koalas.head()
Pandas UDF
Pandas UDF is one other great tool in databricks. It lets you apply a function to a dataframe in a distributed manner. This is beneficial for increasing the efficiency of calculations done on large dataframes. For instance, we are able to define a function that takes a knowledge frame and builds a catboost model. We are able to then use Pandas UDF to use this function at a grouped or categorical level. Let’s construct a model for every value of web service.
To begin we want to define our function and schema for Pandas UDF. The schema simply specifies the column names and their data types:
from pyspark.sql.functions import pandas_udf, PandasUDFTypechurn_schema = StructType(
[
StructField("tenure", FloatType()),
StructField("Contract", StringType()),
StructField("InternetService", StringType()),
StructField("MonthlyCharges", FloatType()),
StructField("Churn", FloatType()),
StructField("Predictions", FloatType()),
]
)
Next we’ll define our function. We’ll simply include the logic we defined earlier in a function called ‘build_model’. To make use of pandas UDF we add the decorator ‘@pandas_udf’:
@pandas_udf(churn_schema, PandasUDFType.GROUPED_MAP)
def build_model(df: pd.DataFrame) -> pd.DataFrame:
And we are able to include the model constructing logic in our function. We’ll also store the predictions and the true churn values in our dataframe:
@pandas_udf(churn_schema, PandasUDFType.GROUPED_MAP)
def build_model(df: pd.DataFrame) -> pd.DataFrame:
df['churn_label'] = np.where(df['Churn']== 'No', 0, 1)
X = df[["tenure", "MonthlyCharges", "Contract"]]
y = df['churn_label']
cats = ["Contract"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = CatBoostClassifier(cat_features= cats, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
output = X_test
output['Prediction'] = y_pred
output['Churn'] = y_test
output['InternetService'] = df['InternetService']
output['churn_label'] = df['churn_label']
return output
Finally we are able to apply this function to our dataframe. Let’s convert our Koalas dataframe back to a spark dataframe:
df_spark = df_koalas.to_spark()
churn_results = (
df_spark.groupBy('InternetService').apply(build_model))
And we are able to convert the resulting spark data frame to a Pandas dataframe (can also convert back to Koalas) and display the primary five rows:
churn_results = churn_results.toPandas()
churn_results.head()
Regardless that we stored predictions, you should utilize Pandas UDF to store any information that you just get consequently of a calculation done on a dataframe. An interesting excercise is to incorporate accuracy rating and precision rating within the output spark dataframe for every web service value.
Getting began with MLflow
One other great tool in Databricks is MlFlow. MlFlow lets you easily run, log and analyze experiments. For this demonstration we’ll work with the primary model object we defined earlier in our notebook. Let’s pip install Mlflow at the highest of our notebook:
%pip install -U mlflow
and import Mlflow:
import mlflow
Let’s proceed by setting an experiment name:
mlflow.set_experiment(
f"/Users/spierre91@gmail.com/churn_model"
)
One thing we are able to log is the Catboost feature importance which is able to allow us to research which features are essential for predicting churn:
feature_importance = pd.DataFrame(
{"variable": model.feature_names_, "importance": model.feature_importances_}
)
feature_importance.to_csv("/feature_importance.csv")
We are able to then log our Catboost model using the log_model method:
with mlflow.start_run(run_name=f"churn_model"):
mlflow.sklearn.log_model(model, "Catboost Model")
We get a notification stating “Logged 1 run to an experiment in Mlflow”:
We are able to click on the run and see the next:
That is where we are able to see metrics like model performance and model artifacts comparable to feature importance. Each of those we’ll show how one can log in Mlflow shortly.
We may click on the experiment:
That is where we see each run related to the experiment. This is beneficial for keeping track of experiments comparable to modifying Catboost parameters, training data, engineered features etc.
Finally, let’s log feature importance as an artifact, accuracy rating and precision rating as metrics, and the list of categorical inputs as a parameter:
with mlflow.start_run(run_name=f"churn_model"):
mlflow.sklearn.log_model(model, "Catboost Model")
mlflow.log_artifact("/feature_importance.csv")
mlflow.log_metric("Precison", precision)
mlflow.log_metric("Accuracy", accuracy)
mlflow.log_param("Categories", cats)
If we click on the run we see we logged feature importance, accuracy rating and precision rating, and categorical inputs:
The code within the Databricks notebook has been ported to a ipython file and is out there in GitHub.
Conclusion
On this post, we discussed how one can start with Databricks. First, we saw how one can add upload data to the DBFS. We then created a notebook and showed how one can access the uploaded file within the notebook. We then proceed to debate tools available in Databricks that help data scientists and researchers scale data science solutions. First, we saw how one can convert spark dataframes to Koalas dataframe, that are a faster alternative to Pandas. We then saw how one can apply customer functions to spark data frames using Pandas UDF. This could be very useful for heavy computational tasks that must be performed on large dataframes. Finally, we saw how one can log metrics, parameters, and artifacts related to modeling experiments. Having familiarity with these tools is very important for anyone working in the information science, machine learning, and machine learning engineering spaces.