How to clean bad data from huge csv file

d_frEak :

So I have huge csv file (assume 5 GB) and I want to insert the data to the table but it return error that the length of the data is not the same

I found that some data has more columns than I want For example the correct data I have has 8 columns but some data has 9 (it can be human/system error)

I want to take only 8 columns data, but because the data is so huge, I can not do it manually or using parsing in python

Any recommendation of a way to do it?

I am using linux, so any linux command also welcome

In sql I am using COPY ... FROM ... CSV HEADER; command to import the csv into table

Romeo Ninov :

You can use awk for this purpose. Assuming you field delimiter is comma (,) this code can do the work:

awk -F\, 'NF==8 {print}' input_file >output_file

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=6656&siteId=1