Running a mysql query on each value in a column in DASK

user3454396 :

I have a CSV file that contains a user-id. This CSV file is imported as a dask-dataframe. Once inside a dataframe, I need to take that user-id, for each entry in the id column and run a SQL query on it fetching the user name of that user-id, and add it to the dataframe in a new column. I have a few such columns that need fetching.

I am unsure what is the DASK way of running select queries against a value in a dask dataframe. How would I go about it? I don't just want to go the imperative route and solve it using a for-loop.

Mark J :

This isn't a full answer, but I can't comment yet

Running multiple queries in a loop is quite inefficient, it would be better to just run a single query to get all the of the user-id username pairs from your database into another dataframe, then use Dask's merge method to join the two dataframes on the user_id column. https://docs.dask.org/en/latest/dataframe-joins.html

Not very experienced with Dask, most of my experience is with Pandas, so there may be a bit more to it than this, but something along these lines:

import dask.dataframe as dd
import pandas as pd

# my_db_connection using whatever database connector you happen to be using
dask_df == dd.read_csv("your_csv_file.csv")
user_df = pandas.read_sql("""
    SELECT user_id, username
    FROM user_table
    """, con=my_db_connection
)

# Assuming both dataframes use "user_id" as the column name, 
# if not use right_on and left_on arguments
merged_df = dask_df.merge(user_df, how="left", on="user_id")

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=350561&siteId=1