I have a CSV file that contains a user-id. This CSV file is imported as a dask-dataframe. Once inside a dataframe, I need to take that user-id, for each entry in the id column and run a SQL query on it fetching the user name of that user-id, and add it to the dataframe in a new column. I have a few such columns that need fetching.
I am unsure what is the DASK way of running select queries against a value in a dask dataframe. How would I go about it? I don't just want to go the imperative route and solve it using a for-loop.
This isn't a full answer, but I can't comment yet
Running multiple queries in a loop is quite inefficient, it would be better to just run a single query to get all the of the user-id username pairs from your database into another dataframe, then use Dask's merge method to join the two dataframes on the user_id column. https://docs.dask.org/en/latest/dataframe-joins.html
Not very experienced with Dask, most of my experience is with Pandas, so there may be a bit more to it than this, but something along these lines:
import dask.dataframe as dd
import pandas as pd
# my_db_connection using whatever database connector you happen to be using
dask_df == dd.read_csv("your_csv_file.csv")
user_df = pandas.read_sql("""
SELECT user_id, username
FROM user_table
""", con=my_db_connection
)
# Assuming both dataframes use "user_id" as the column name,
# if not use right_on and left_on arguments
merged_df = dask_df.merge(user_df, how="left", on="user_id")