Datasets¶
Our dataset is based in the webcompat GitHub issues submitted on webcompat/web-bugs. In order to build the dataset we used the GitHub API and specifically the following endpoints:
Datasets are distributed in CSV
format in the same repository with the rest of the codebase as Git-LFS objects
under /datasets.
Available datasets¶
All events/issues¶
See also
This dataset includes all the issue-related historic data available in GitHub. It combines all the issue data available from the API combined with all the events from each issue. This allows us to go through the lifecycle of each issue and extract features.
Needs diagnosis¶
See also
This dataset is used to train the needsdiagnosis
model. Based on All events/issues
we select all the closed
issues and we extract the needsdiagnosis
feature
based on their events. An issue marked as needsdiagnosis
is an issue that through its lifecycle (issue events) reached the needsdiagnosis
milestone.
Because of the nature of the webcompat issues, the data is unbalanced and have much more data points for needsdiagnosis = False
. Trying to factor this in
our model we tried 2 different approaches. The one was to use all the data we have and the other was to balance the entry points so he wave the same number of
needsdiagnosis = True
and needsdiagnosis = False
entries.
While going through the data, we noticed an inflation in our metrics which looked suspicious. It turned out that some of the titles get changed as part of the
triaging process which hinted that needsdiagnosis
target. In order to fix that we went through the events and extracted the original titles.
The metrics for the ML model gave more balanced results for the balanced dataset but since we are trying to tackle the problem of having too much noise in our input
we currently use datasets/needsdiagnosis-full-original-titles.csv
.