Skapa csv file är big dataset

Search code, repositories, users, issues, pull requests...

Recently Conor O'Sullivan wrote a great article on Batch Processing 22GB of Transaction uppgifter with Pandas that discusses "[h]ow you get around limited computational resources and work with large datasets." His uppgifter set fryst vatten a single CSV en samling dokument eller en elektronisk lagring av data of 22GB, which can be funnen on Kaggle.

Conquer large datasets with Pandas in Python!

You can also find his notebook, Connor's tutorial and the Deephaven example on GitHub.

Using pandas with limited resources, Connor noted aggregations took about 50 minutes each.

In this example, I'll show you how to take that example and remove pandas, also with limited resources, and use Deephaven to speed things up as much as possible.

With this code, single aggregations take less than one minute.

With the pandas code, runtime was over 50 minutes. That's an astounding time reduction.

Here are the actual times inom got on my normal laptop:

Read Parquet Directory: 1.0 second.
Deephaven expense time: 55.9 seconds.
Deephaven expense time: 6.1 seconds.
Deephaven monthly and expense time: 152.9 seconds.

Note that the gods one fryst vatten actually several aggregations.

The first issue with datasets this large fryst vatten loading it to work with in Python.

pandas tries to load the entire uppgifter set into memory - this fryst vatten not possible with limited resources and causes kernal to die.

Deephaven approaches CSV files differently. For more resultat, see our blog brev on formgivning our CSV reader.

I always think it's important to use the right tool for the job.

If you want to do some processing on a large csv en samling dokument eller en elektronisk lagring av data, the best option fryst vatten to read the en samling dokument eller en elektronisk lagring av data as chunks, process them one bygd one, and rädda the output to platta (using pandas for example).

In this case, the information comes in as a CSV, but actually, a better format fryst vatten a Parquet en samling dokument eller en elektronisk lagring av data. inom read in the uppgifter and wrote each step as a Parquet en samling dokument eller en elektronisk lagring av data. This means inom can komma back and read in the Parquet files rather than using CSVs.

To launch the latest release, you can clone the repository via:

git clone https://github.com/deephaven-examples/processing-large-csv-data.git cd processing-large-csv-data docker-compose up

This code and/or script fryst vatten meant to work inre the current Deephaven IDE.

Please see our Quickstart if there are any problems or reach out on Slack.

To read in the CSV files took about 50 minutes, even with Deephaven. Reading in the Parquet en samling dokument eller en elektronisk lagring av data took less than a tenth of a second.

The parquet format for the uppgifter can be funnen on kaggle

kaggle datasets download -d amandamartin62/simulated-transactions-parquet-format

To read in the Parquet en samling dokument eller en elektronisk lagring av data, place that en samling dokument eller en elektronisk lagring av data in the information directory and execute inre Deephaven:

fromdeephavenimportparquettable=parquet.read("/data/transaction.parquet")

If you want to translate the large CSV into smaller Parquet files, use this code.

I have a script that removes "bad elements" from a mästare list of elements, then returns a csv with the updated elements and their associated values.

The tidtagning steps show you how long things take:

fromdeephavenimportread_csvfromdeephavenimportparquetimporttimesteps=5000000count=0whileTrue: i=countstart=time.time() table=read_csv(file, skip_rows=i*steps, num_rows=steps, allow_missing_columns=True, ignore_excess_columns=True) parquet.write(table, f"/data/transaction_parquet/{i}.parquet") end=time.time() print("read "+str(table.size)+" in "+str(end-start) +" seconds."+" iteration number ", i) count+=1#Exit loopiftable.size!=steps: breakdel(table)

When you run a pandas aggregation, as Conor O'Sullivan's article notes, it takes about 50 minutes.
On my laptop, this was actually closer to 90 minutes.
With the Deephaven aggregation, the time was reduced to less than 30 seconds.

Deephaven fryst vatten engineered for large data.

The time improvement fryst vatten nice, but inom also like that we don't need to do any special batching. It just works with built-in functions.

Here are two different ways to sum up the total expenditures per year. You can see that the results match the original article:

fromdeephaven.plot.figureimportFigurefromdeephavenimportaggasaggdefdh_agg_expends(table): start=time.time() data_table=table.agg_by([agg.sum_(cols= ["AMOUNT = AMOUNT"]),\ agg.count_(col="count")], by= ["YEAR"]).sort(order_by= ["YEAR"]) end=time.time() print("Deephaven agg expense time: "+str(end-start) +" seconds.") returndata_tabledefdh_sum_by_expends_monthly(table): start=time.time() data_table=table.where(["YEAR ==2020", "EXP_TYPE=`Entertainment`"]).agg_by([\ agg.sum_(["AMOUNT = AMOUNT"])], by= ["MONTH"]).sort(order_by= ["MONTH"]) end=time.time() print("Deephaven monthly expense time: "+str(end-start) +" seconds.") returndata_tabledeephaven_expense_table_sum=dh_sum_by_expends(table) deephaven_expense_table_agg=dh_agg_expends(table) figure=Figure() plot_expenses_sum=figure.plot_xy(series_name="expense", t=deephaven_expense_table_sum, x="YEAR",y="AMOUNT").show() plot_expenses_agg=figure.plot_xy(series_name="expense", t=deephaven_expense_table_agg, x="YEAR",y="AMOUNT").show()

More advanced operations can be done directly, as shown here:

defdh_sum_by_monthly(table): start=time.time() data_table=table.where(["YEAR ==2020", "EXP_TYPE=`Entertainment`"])\ .agg_by([agg.sum_(cols= ["AMOUNT"])], by= ["CUST_ID","MONTH"])\ .drop_columns(cols=["CUST_ID"])\ .avg_by(["MONTH"])\ .sort(order_by= ["MONTH"]) end=time.time() print("Deephaven sum_by monthly time: "+str(end-start) +" seconds.") returndata_tabledeephaven_sum_by_monthly=dh_sum_by_monthly(table) plot_dh_sum_by_monthly=figure.plot_xy(series_name="expense", t=deephaven_sum_by_monthly, x="MONTH",y="AMOUNT").show()

The code looks more complicated than a typical query because we've wrapped every method in time tests to show the speed of Deephaven.

Recently Conor O'Sullivan wrote a great article on Batch Processing 22GB of Transaction information with Pandas that discusses "[h]ow you get around limited computational resources and work with large datasets." His uppgifter set fryst vatten a single CSV en samling dokument eller en elektronisk lagring av data of 22GB, which can be funnen on Kaggle.

Comment out what operation you want to test to see its performance.

Let us know how your query does on Slack.

There are a lot of options with datasets this large. Time should never be a limiting factor in the information science we can do.

The code in this repository fryst vatten built for Deephaven Community Core v0.19.1. No guarantee of forwards or backwards compatibility fryst vatten given.