git is a tool used for software development. It supports version controls and allows multiple people to work on the same codebase in parallel. Most Data teams will use Git as a part of their regular workflow. This lesson will focus on GitHub, a commonly used web platform for Git.
A repository (repo) is a directory hosted on GitHub with a log history detailing any content changes. Each log entry is called a commit, which is made up of content changes, a user created message summarizing these changes and a unique identifier called a SHA.
Say that I find a typo in the linear regression lesson on this website. I will correct the typo, add the changes, write a commit message (for instance, "correcting typo in linear regression") and push these changes to our GitHub repo.
The repo logs is updated with a new entry that has SHA
6y7laec5c372b366f1c4e1f0a55947c718a81a9 (a randomly generated ID), the message "correcting typo in linear regression" and the updates I made in
Local vs Remote
Let's use this website as an example again. The codebase is hosted on GitHub, but also on each contributor's device. The copy on our devices are the local repositories while the one on GitHub is the remote repository.
If I made a change on the local repository, the website will not be affected. I would have to push the changes from local to remote in order to update the actual website.
Remote repositories are referenced by their names. The default name is
origin but that can also be customized.
Feature and Master Branch
Branches allow multiple people to work on the website in parallel. Each branch should contain a separate feature. The
master branch (which is the default branch created alongside the repo) should be protected and updated only after code changes have been thoroughly reviewed and tested. Code in our
master branch is the one being deployed -- any changes made to
master will be reflected on the website. "Master" is a commonly used naming convention; depending on the setup, you can choose to sync whichever branch with the end product.
If I wanted to work on a new feature for the website (for instance, a new lesson), I will take the current
master branch and create a new feature branch from it (essentially as a copy). I will make the necessary changes and make a formal proposal (a Pull Request) to merge these changes into
Pull Requests / Code Reviews
Each pull request contains a unique identifying number, a description of what was changed (and why the changes were necessary) along with a visual diff of the changes. Merging the pull request folds the changes into the
In most cases, at least one approval from a fellow collaborator is required to merge the pull request. Reviewers can approve the pull request from the UI, request changes or leave comments.
Reviewers can also pull the changes from the feature branch to view it on their local device.
You can also refer to the official GitHub glossary for any Git related jargons.
For a more comprehensive list of Git commands, visit the official Git cheatsheat.
Check Status of Local Repository
This shows which files were changed.
# get the change differences for all files
# get the change differences for a specific file
git diff [file name]
Stage Changed Files to be Committed
git add [name_of_files]
# add all changed files
git add .
# add tracked files (previously added files)
git add -u
Commit Staged Changes
git commit -m "Enter your message here"
Reset (Undo) Commited Changes
# undo commits after given commit
git reset [commit SHA]
# undo and discard changes after given commit
git reset --hard [commit SHA]
# get changes from remote origin
git pull origin [branch name]
# push changes to remote origin
git push origin [branch name]
# create a new local branch
git checkout -b [name_of_branch]
# switch to another local branch
git checkout [name_of_branch]
Here are some additional reading(s) that may be helpful: