I have a .csv file that is split in sections, each starting with < string > on a row of its own as in this example. This is followed by a set of columns and their respective rows of values. Columns are not consistent between sections.
< section1 ><br>
col1 col2 col3<br>
val1 val2 val3
< section2 ><br>
col3 col4 col5<br>
val4 val5 val6<br>
val7 val8 val9
...etc. Is there a way in which I can, either when the file's in .txt or .csv, import each section either: 1) into seperate dataframes? 2) into the same dataframe, but something like df[section][col]?
Many thanks!
Depending on the size of your csv, you could read in the entire file into Pandas and split the dataframe into multiple dataframes via a list comprehension.
data = '''<Network>;;;;;;;;;;;;;;;;;;;;;
Property;Value;;;;;;;;;;;;;;;;;;;;
Title;;;;;;;;;;;;;;;;;;;;;
Version;6.4;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;
<Sites>;;;;;;;;;;;;;;;;;;;;;
Name;LocationCode;Longitude;Latitude;;;;;;;;;;...'''
df = pd.read_csv(StringIO(data), header=None)
create a list of dataframe names (the headers of each df)
df_names = df[0].str.extract(r'(<[a-zA-Z]+>)')[0].str.strip('<>').dropna().tolist()
find the indices for the headers regions = df.loc[df[0].str.contains(r'<[a-zA-Z]+')].index.tolist()
last_row = df.index[-1]
regions.append(last_row)
from more_itertools import windowed
create windows for each 'sub' dataframe
regions_window = list(windowed(regions,2))
the function helps with some cleanup during the dataframe extraction
def some_cleanup(df):
df.columns = df.iloc[0].str.extract(r'(<[a-zA-z]+>)')[0].str.strip('<>')
df = df.iloc[1:]
return df
extract the dataframes
M = [df.loc[start:end].pipe(some_cleanup) for start,end in regions_window]
create a dict with the keys as the dataframe names
dataframe_dict = dict(zip(df_names,M))