This is an experimental and opinionated reproducible workflow for working with Kaggle competitions. The Kaggle Flow will always check if the competition rules are accepted and the data files for the competition are readily available. If they are not, they will be downloaded.
Find the competition you want to work on. Methods have been built out
to accept multiple forms of id. The follow examples below
will be using the titanic competition to show the example
inputs.
https://www.kaggle.com/c/titanic and
https://www.kaggle.com/c/titanic/code will recognize
titanic as the IDhttps://www.kaggle.com/c/ and before the
next forward slash, if it existskaggle competitions download -c titanicand recognize the ID
as titanic
titanic directlyThe flow will always check if the user has accepted the rules to the competition. If the rules have not been accepted, a prompt will be shown notifying the user of the error and an input to take the user to the competitions rules.
library(kaggler)
kgl_flow(id = "tabular-playground-series-jun-2021")
#> x You must accept this competition's rules before you'll be able to download files.
#> Would you like to visit 'https://www.kaggle.com/c/tabular-playground-series-jun-2021/rules' to accept the rules?
#>
#> 1: Nope
#> 2: Yup
#> 3: No wayNow lets switch to a different project and a competition my account
has accepted the rules for. Running kgl_flow() will
download all the files I need and also store some metadata to keep track
of the competition ID and information about the competitions data
files.
kgl_flow("titanic")
#> • These files will be downloaded:
#> - 'gender_submission'
#> - 'test'
#> - 'train'.
#> • Downloading 'gender_submission.csv'...
#> • Downloading 'test.csv'...
#> • Downloading 'train.csv'...The files have been saved into a new directory;
_kaggle_data.
fs::dir_ls("_kaggle_data/")
#> _kaggle_data/gender_submission.csv
#> _kaggle_data/meta
#> _kaggle_data/test.csv
#> _kaggle_data/train.csvWe can get some information about our competition data by looking at the metadata.
kgl_flow_meta()
#> ℹ Competition ID: 'titanic'
#> # A tibble: 3 x 10
#> id ref name description total_bytes url creation_date download_time nrows ncols
#> <chr> <chr> <chr> <chr> <int> <chr> <dttm> <dttm> <int> <int>
#> 1 titan… gender_… gender… "An example of what a sub… 3258 https:… 2018-04-09 05:33:22 2021-08-26 16:19:50 418 2
#> 2 titan… test.csv test.c… "test data to check the a… 28629 https:… 2018-04-09 05:33:22 2021-08-26 16:19:51 418 11
#> 3 titan… train.c… train.… "contains data " 61194 https:… 2018-04-09 05:33:22 2021-08-26 16:19:52 891 12We can also look at the competition information that is returned by the Kaggle API.
kgl_flow_competition_info()
#> # A tibble: 1 × 23
#> id ref title url description organization_na… organization_ref category
#> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 3136 titanic Tita… http… Start here… Kaggle kaggle Getting…
#> # … with 15 more variables: reward <chr>, deadline <dttm>, kernel_count <int>,
#> # team_count <int>, user_has_entered <lgl>, user_rank <lgl>,
#> # merger_deadline <dttm>, new_entrant_deadline <dttm>, enabled_date <dttm>,
#> # max_daily_submissions <int>, max_team_size <lgl>, evaluation_metric <chr>,
#> # awards_points <lgl>, is_kernels_submissions_only <lgl>,
#> # submissions_disabled <lgl>If the competitions data is all in csv format, then they can easily be loaded in.
kgl_flow_load()
#> ℹ Competition ID: 'titanic'
#> ✓ The data has been loaded into the global environment!
#> - 'gender_submission'
#> - 'test'
#> - 'train'In an unwanted situation where one of the files gets accidentily deleted, kgl_flow_load() will reference the metadata to make sure all files are available before loading them in.
fs::file_delete("_kaggle_data/train.csv")
kgl_flow_load()
#> x There seem to be files missing! Run 'kgl_flow()' to make sure all files are present.As prompted, we can run kgl_flow() again to get the
files back.
kgl_flow()
#> ℹ These files are detected in '_kaggle_data/' and will not be downloaded:
#> - 'gender_submission'
#> - 'test'
#> ● These files will be downloaded:
#> - 'train'.
#> ● Downloading 'train.csv'...We did not need to supply the id this time because the
flow will check if an ID has been recorded in the metadata.
kgl_flow_leaderboard()
#> • Downloading leaderboard data for 'titanic'
#> ✓ Leaderboard Data Downloaded! #>
#> # A tibble: 50,327 × 4
#> team_id team_name submission_date score
#> <dbl> <chr> <dttm> <dbl>
#> 1 2596702 Itaegyun 2021-07-12 08:15:47 1
#> 2 6650429 arduin 2021-07-12 12:00:53 1
#> 3 6931429 Nguyen Duc Tung (K… 2021-07-14 00:57:27 1
#> 4 6931547 Bach Nguyen #3 2021-07-14 05:27:44 1
#> 5 6931673 DDWanderer 2021-07-14 02:00:10 1
#> 6 6931524 he130230 2021-07-14 02:11:04 1
#> 7 6931497 (K12_HN) Pham Vu T… 2021-07-14 02:55:22 1
#> 8 6933163 HE130793 2021-07-14 04:21:24 1
#> 9 7006924 hebimonhp 2021-07-14 04:08:43 1
#> 10 6931521 Duc North 2021-07-14 04:45:48 1
#> # … with 50,317 more rowsThis has been heavily influenced by the {targets} package. Any issues or ideas for improvements to this experimental flow is greatly appreciated!