This is an experimental and opinionated reproducible workflow for working with Kaggle competitions. The Kaggle Flow will always check if the competition rules are accepted and the data files for the competition are readily available. If they are not, they will be downloaded.
Find the competition you want to work on. Methods have been built out
to accept multiple forms of id
. The follow examples below
will be using the titanic
competition to show the example
inputs.
https://www.kaggle.com/c/titanic
and
https://www.kaggle.com/c/titanic/code
will recognize
titanic
as the IDhttps://www.kaggle.com/c/
and before the
next forward slash, if it existskaggle competitions download -c titanic
and recognize the ID
as titanic
titanic
directlyThe flow will always check if the user has accepted the rules to the competition. If the rules have not been accepted, a prompt will be shown notifying the user of the error and an input to take the user to the competitions rules.
library(kaggler)
kgl_flow(id = "tabular-playground-series-jun-2021")
#> x You must accept this competition's rules before you'll be able to download files.
#> Would you like to visit 'https://www.kaggle.com/c/tabular-playground-series-jun-2021/rules' to accept the rules?
#>
#> 1: Nope
#> 2: Yup
#> 3: No way
Now lets switch to a different project and a competition my account
has accepted the rules for. Running kgl_flow()
will
download all the files I need and also store some metadata to keep track
of the competition ID and information about the competitions data
files.
kgl_flow("titanic")
#> • These files will be downloaded:
#> - 'gender_submission'
#> - 'test'
#> - 'train'.
#> • Downloading 'gender_submission.csv'...
#> • Downloading 'test.csv'...
#> • Downloading 'train.csv'...
The files have been saved into a new directory;
_kaggle_data
.
fs::dir_ls("_kaggle_data/")
#> _kaggle_data/gender_submission.csv
#> _kaggle_data/meta
#> _kaggle_data/test.csv
#> _kaggle_data/train.csv
We can get some information about our competition data by looking at the metadata.
kgl_flow_meta()
#> ℹ Competition ID: 'titanic'
#> # A tibble: 3 x 10
#> id ref name description total_bytes url creation_date download_time nrows ncols
#> <chr> <chr> <chr> <chr> <int> <chr> <dttm> <dttm> <int> <int>
#> 1 titan… gender_… gender… "An example of what a sub… 3258 https:… 2018-04-09 05:33:22 2021-08-26 16:19:50 418 2
#> 2 titan… test.csv test.c… "test data to check the a… 28629 https:… 2018-04-09 05:33:22 2021-08-26 16:19:51 418 11
#> 3 titan… train.c… train.… "contains data " 61194 https:… 2018-04-09 05:33:22 2021-08-26 16:19:52 891 12
We can also look at the competition information that is returned by the Kaggle API.
kgl_flow_competition_info()
#> # A tibble: 1 × 23
#> id ref title url description organization_na… organization_ref category
#> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 3136 titanic Tita… http… Start here… Kaggle kaggle Getting…
#> # … with 15 more variables: reward <chr>, deadline <dttm>, kernel_count <int>,
#> # team_count <int>, user_has_entered <lgl>, user_rank <lgl>,
#> # merger_deadline <dttm>, new_entrant_deadline <dttm>, enabled_date <dttm>,
#> # max_daily_submissions <int>, max_team_size <lgl>, evaluation_metric <chr>,
#> # awards_points <lgl>, is_kernels_submissions_only <lgl>,
#> # submissions_disabled <lgl>
If the competitions data is all in csv format, then they can easily be loaded in.
kgl_flow_load()
#> ℹ Competition ID: 'titanic'
#> ✓ The data has been loaded into the global environment!
#> - 'gender_submission'
#> - 'test'
#> - 'train'
In an unwanted situation where one of the files gets accidentily deleted, kgl_flow_load() will reference the metadata to make sure all files are available before loading them in.
fs::file_delete("_kaggle_data/train.csv")
kgl_flow_load()
#> x There seem to be files missing! Run 'kgl_flow()' to make sure all files are present.
As prompted, we can run kgl_flow()
again to get the
files back.
kgl_flow()
#> ℹ These files are detected in '_kaggle_data/' and will not be downloaded:
#> - 'gender_submission'
#> - 'test'
#> ● These files will be downloaded:
#> - 'train'.
#> ● Downloading 'train.csv'...
We did not need to supply the id
this time because the
flow will check if an ID has been recorded in the metadata.
kgl_flow_leaderboard()
#> • Downloading leaderboard data for 'titanic'
#> ✓ Leaderboard Data Downloaded! #>
#> # A tibble: 50,327 × 4
#> team_id team_name submission_date score
#> <dbl> <chr> <dttm> <dbl>
#> 1 2596702 Itaegyun 2021-07-12 08:15:47 1
#> 2 6650429 arduin 2021-07-12 12:00:53 1
#> 3 6931429 Nguyen Duc Tung (K… 2021-07-14 00:57:27 1
#> 4 6931547 Bach Nguyen #3 2021-07-14 05:27:44 1
#> 5 6931673 DDWanderer 2021-07-14 02:00:10 1
#> 6 6931524 he130230 2021-07-14 02:11:04 1
#> 7 6931497 (K12_HN) Pham Vu T… 2021-07-14 02:55:22 1
#> 8 6933163 HE130793 2021-07-14 04:21:24 1
#> 9 7006924 hebimonhp 2021-07-14 04:08:43 1
#> 10 6931521 Duc North 2021-07-14 04:45:48 1
#> # … with 50,317 more rows
This has been heavily influenced by the {targets} package. Any issues or ideas for improvements to this experimental flow is greatly appreciated!