Program: BasicDataCleansingAndAnalysisForPandas
loading x elements...

Name: BasicDataCleansingAndAnalysisForPandas

Discussion Thread

Version: 58 (release)

Newest release version: This is the newest release version of this Program.

Newest release or development version: This is the newest version of this Program.

Creator: Florian_Dietz

This program is called by Do-basics-for-task-data-cleansing-and-analysis-for-pandas to perform basics data cleansing and analysis on task_data_cleansing_and_analysis_for_pandas.

This program also acts as the central control unit to solve task_data_cleansing_and_analysis_for_pandas.


Note: This program is part of the first real, practical example that was developed for Elody. It is deliberately overengineered to test boundaries. It is inefficient, but easily extensible by other contributors.


This program executes multiple times. Every time it runs, it adds a number of Options and option_to_modify_table Tags (or none) and applies any changes from previously defined option_to_modify_table Tags that have been accepted in the meantime. If no changes were defined or applied and this isn't the first time the program executes, it will mark the require_open_ended_work with !provide.

(This means that this program runs at least twice, and likely more often. While this is not efficient, the operations performed by this program are all fairly simple, so the runtime is acceptable for reasonably-sized pandas files. Additionally, the program avoids running expensive algorithms that have already been run before if it doesn't look like circumstances changed and they should be run again. Running the program multiple times makes it easier to integrate other programs with this program, because interdependencies are caught easily, simply because of redundancy.)

If the Tag restrict_analysis_types_for_data_exploration_pandas, it gets even more complicated and more calls of this Program are necessary, although fewer calculations are done with each call and the execution order is more structured and easier to understand.

theTask : A task_data_cleansing_and_analysis_for_pandas Tag.
theFile : the file to look at.


If you want to supplement this program to improve on task_data_cleansing_and_analysis_for_pandas:

Your program needs to run after the first and before the last execution of this program:
Wait until this program has run, which can be recognized by watching for a signal_to_run_advanced_data_cleansing_and_analysis_for_pandas Tag, then look at the result and generate your own option_to_modify_table and/or update modifiable_file.

If your programs waits on an option_to_modify_table to be accepted before making its change, you need to make sure the Option to execute the program that makes the change runs at a confidence higher than 100. This is because the Option rerun-data-cleansing-and-analysis-manager, which triggers this program again, runs at confidence 100.

You can look at Do-advanced-data-cleansing-and-analysis-for-pandas and AdvancedDataCleansingAndAnalysisForPandas for an example of how to do this.


Features of this program:

-If no header is set yet in the pandas file and the first line is all strings, the first line is extracted as a header
(This is a minor correction. It doesn't always work, but it's good enough until a dedicated program for fixing formatting errors is written.)

-It generates a number of Tags that give information about the table and its columns. This will happen every time it runs, so it is possible for it to overwrite info_ Tags that have been previously set. You can attach a please_stop_helping Tag with comment \"BasicDataCleansingAndAnalysisForPandas\" to an info_ Tag to prevent this program from overwriting it.

-Among the info_ Symbols set here, info_column_types is special: This very complex and important Tag is only set the first time the program runs, unless the latest info_column_types on a column has comment=\"TBD\". This ensures that this program will set baseline values for info_column_types, but will not overwrite them if another program replaces the Tag with a more precise analysis. Also, datetimes and timedeltas are marked as 'discrete' if all entries are full days, otherwise 'continuous'.

-Performs the following table modifications using option_to_modify_table:
-Transform a column to numeric using
This has a confidence of 1.05 if there are no null values, 0.95 otherwise.
-Transform a column to datetime using
This has a confidence of 1.05 if there are no null values, 0.95 otherwise.
-Transform a column to timedelta using
This has a confidence of 0.4 if the column could also be turned into a numeric datatype instead, and 0.8 otherwise.
-Mark an integer column as info_primary_key if its values are all consecutive integers.
This has a confidence of 1.1 if the integers start counting at 0 or 1, and 0.9 otherwise.

ID: 292

Created: July 7, 2019, 10:57 a.m.

Docker Image:

Source code: Run the following command in a terminal to download the source code: 'lod-tools download-program -f <destination_folder> --name "BasicDataCleansingAndAnalysisForPandas" --version 58'

all versions of this Program:

Version 58 (release)

Version 57 (release)

Version 56 (release)

Version 55 (release)

Version 54 (release)

Version 53 (release)

Version 52 (development)

Version 51 (development)

Version 50 (development)

Version 49 (development)

Version 48 (development)

Version 47 (development)

Version 46 (development)

Version 45 (development)

Version 44 (development)

Version 43 (development)

Version 42 (development)

Version 41 (development)

Version 40 (development)

Version 39 (development)

Version 38 (development)

Version 37 (development)

Version 36 (development)

Version 35 (development)

Version 34 (development)

Version 33 (development)

Version 32 (development)

Version 31 (development)

Version 30 (development)

Version 29 (development)

Version 28 (development)

Version 27 (development)

Version 26 (release)

Version 25 (development)

Version 24 (development)

Version 23 (development)

Version 22 (development)

Version 21 (development)

Version 20 (development)

Version 19 (development)

Version 18 (development)

Version 17 (development)

Version 16 (development)

Version 15 (development)

Version 14 (development)

Version 13 (development)

Version 12 (development)

Version 11 (development)

Version 10 (release)

Version 9 (release)

Version 8 (development)

Version 7 (development)

Version 6 (development)

Version 5 (development)

Version 4 (development)

Version 3 (development)

Version 2 (development)

Version 1 (development)