Kaggle Flow

This is an experimental and opinionated reproducible workflow for working with Kaggle competitions. The Kaggle Flow will always check if the competition rules are accepted and the data files for the competition are readily available. If they are not, they will be downloaded.

ID input

Find the competition you want to work on. Methods have been built out to accept multiple forms of id. The follow examples below will be using the titanic competition to show the example inputs.

  1. The competition URL
    • https://www.kaggle.com/c/titanic and https://www.kaggle.com/c/titanic/code will recognize titanic as the ID
    • Anything after https://www.kaggle.com/c/ and before the next forward slash, if it exists
  2. Kaggle’s API command
    • Kaggles official API is built in python and they supply a command to download the data on the data tab of a competition. The download functions will take kaggle competitions download -c titanicand recognize the ID as titanic
  3. Explicit ID
    • If the above two cases don’t match in the logic then everything else will be considered an ID
    • Example; entering titanic directly

Example Work Flow

The flow will always check if the user has accepted the rules to the competition. If the rules have not been accepted, a prompt will be shown notifying the user of the error and an input to take the user to the competitions rules.

library(kaggler)

kgl_flow(id = "tabular-playground-series-jun-2021")

#> x You must accept this competition's rules before you'll be able to download files.
#> Would you like to visit 'https://www.kaggle.com/c/tabular-playground-series-jun-2021/rules' to  accept the rules?
#> 
#> 1: Nope
#> 2: Yup
#> 3: No way

Now lets switch to a different project and a competition my account has accepted the rules for. Running kgl_flow() will download all the files I need and also store some metadata to keep track of the competition ID and information about the competitions data files.

kgl_flow("titanic")

#> • These files will be downloaded:
#>   - 'gender_submission'
#>   - 'test'
#>   - 'train'.
#> • Downloading 'gender_submission.csv'...
#> • Downloading 'test.csv'...
#> • Downloading 'train.csv'...

The files have been saved into a new directory; _kaggle_data.

fs::dir_ls("_kaggle_data/")

#> _kaggle_data/gender_submission.csv
#> _kaggle_data/meta
#> _kaggle_data/test.csv
#> _kaggle_data/train.csv

We can get some information about our competition data by looking at the metadata.

kgl_flow_meta()

#> ℹ Competition ID: 'titanic'
#> # A tibble: 3 x 10
#>   id     ref      name    description                total_bytes url      creation_date       download_time       nrows ncols
#>   <chr>  <chr>    <chr>   <chr>                            <int> <chr>   <dttm>              <dttm>              <int> <int>
#> 1 titan… gender_… gender… "An example of what a sub…        3258 https:… 2018-04-09 05:33:22 2021-08-26 16:19:50   418     2
#> 2 titan… test.csv test.c… "test data to check the a…       28629 https:… 2018-04-09 05:33:22 2021-08-26 16:19:51   418    11
#> 3 titan… train.c… train.… "contains data "                 61194 https:… 2018-04-09 05:33:22 2021-08-26 16:19:52   891    12

We can also look at the competition information that is returned by the Kaggle API.

kgl_flow_competition_info()

#> # A tibble: 1 × 23
#>      id ref     title url   description organization_na… organization_ref category
#>   <int> <chr>   <chr> <chr> <chr>       <chr>            <chr>            <chr>   
#> 1  3136 titanic Tita… http… Start here… Kaggle           kaggle           Getting…
#> # … with 15 more variables: reward <chr>, deadline <dttm>, kernel_count <int>,
#> #   team_count <int>, user_has_entered <lgl>, user_rank <lgl>,
#> #   merger_deadline <dttm>, new_entrant_deadline <dttm>, enabled_date <dttm>,
#> #   max_daily_submissions <int>, max_team_size <lgl>, evaluation_metric <chr>,
#> #   awards_points <lgl>, is_kernels_submissions_only <lgl>,
#> #   submissions_disabled <lgl>

If the competitions data is all in csv format, then they can easily be loaded in.

kgl_flow_load()

#> ℹ Competition ID: 'titanic'
#> ✓ The data has been loaded into the global environment!
#>   - 'gender_submission'
#>   - 'test'
#>   - 'train'

In an unwanted situation where one of the files gets accidentily deleted, kgl_flow_load() will reference the metadata to make sure all files are available before loading them in.

fs::file_delete("_kaggle_data/train.csv")

kgl_flow_load()
#> x There seem to be files missing! Run 'kgl_flow()' to make sure all files are present.

As prompted, we can run kgl_flow() again to get the files back.

kgl_flow()

#> ℹ These files are detected in '_kaggle_data/' and will not be downloaded:
#>   - 'gender_submission'
#>   - 'test'
#> ● These files will be downloaded:
#>   - 'train'.
#> ● Downloading 'train.csv'...

We did not need to supply the id this time because the flow will check if an ID has been recorded in the metadata.

Submit to competition

Work in progress 👷

Leaderboard

kgl_flow_leaderboard()

#> • Downloading leaderboard data for 'titanic'
#> ✓ Leaderboard Data Downloaded!                                        #>                      
#> # A tibble: 50,327 × 4
#>    team_id team_name           submission_date     score
#>      <dbl> <chr>               <dttm>              <dbl>
#>  1 2596702 Itaegyun            2021-07-12 08:15:47     1
#>  2 6650429 arduin              2021-07-12 12:00:53     1
#>  3 6931429 Nguyen Duc Tung (K… 2021-07-14 00:57:27     1
#>  4 6931547 Bach Nguyen #3      2021-07-14 05:27:44     1
#>  5 6931673 DDWanderer          2021-07-14 02:00:10     1
#>  6 6931524 he130230            2021-07-14 02:11:04     1
#>  7 6931497 (K12_HN) Pham Vu T… 2021-07-14 02:55:22     1
#>  8 6933163 HE130793            2021-07-14 04:21:24     1
#>  9 7006924 hebimonhp           2021-07-14 04:08:43     1
#> 10 6931521 Duc North           2021-07-14 04:45:48     1
#> # … with 50,317 more rows

Special thanks

This has been heavily influenced by the {targets} package. Any issues or ideas for improvements to this experimental flow is greatly appreciated!