Home Artificial Intelligence Git For the Modern Data Scientist: 9 Git Concepts You Can’t Ignore

Git For the Modern Data Scientist: 9 Git Concepts You Can’t Ignore

0
Git For the Modern Data Scientist: 9 Git Concepts You Can’t Ignore

3. Staging area

By talking about commits, we’ve got got ahead of ourselves. Before closing the cap of the commit capsule, you could have to be sure the contents inside are right.

This involves telling Git exactly which changes from which files you ought to commit. Sometimes, recent changes might come from several files and it’s possible you’ll only need to commit a few of them and leave the remainder for future commits.

That is where we lift the curtains and reveal the staging area (pun intended):

Image by me. The staging area is modified after the changes in train.py are added.

The concept is that you should have a way of double-checking, editing, or undoing the changes you ought to add to your Git history before you press that commit button.

Adding the brand new changes to the staging area (or Git index as some kids say it) permits you to do this. The world holds the changes you ought to include in the subsequent commit.

Let’s say you modified each clean.py and train.py. For those who add the changes in train.py with git add train.py to the staging area, the subsequent commit will only include that change.

The modified clean.py will stay as is (uncommitted).

Image by me. The image above reshown for clarity.

So, here is a straightforward workflow for you:

  1. Track recent files with Git (only done once)
  2. Add changes in tracked files to the staging area with git add changed_file.extension
  3. Commit the changes within the staging area to history with git commit -m "Commit message".

4. Hashes and tags

Aside from messages, all Git commits have hashes so you possibly can point to them more easily.

Image by me. Three sample commits with 7-character hashes.

A hash is a string with 40 hexadecimal characters that give each commit unique IDs, like 1a3b5c7d9e2f4g6h8i0j1k2l3m4n5o6p7q8r9s0t.

They make switching between commits (different versions of your code base) much easier with git checkout HASH. You do not have to jot down the total hash when switching. Only the primary few characters of the hash that make it unique are enough.

You possibly can list all of the commits you’ve made with their hashes using git log (this shows the writer and message of the commit).

To list only the hash and the message without cluttering up your screen, you should use git log --oneline.

Image by me. The command to list your Git log line-by-line.

If hashes intimidate you, there are also Git tags. A Git tag is a friendly nickname you possibly can give to some necessary commits (or any) to recollect and seek advice from them much more easily.

Image by me. 4 commits with two of them tagged.

You should utilize the command “git tag” to assign tags to specific commits which are necessary, akin to those containing an important feature or a big code base release (e.g., v1.0.0). Moreover, you possibly can tag a commit that represents your best model, akin to “random_forest_best”.

Consider tags as little human-readable milestones that stand out amongst all of the commit hashes.

To make clear, the command git tag ‘tag_name’ will only add a tag to the last commit. If you ought to add a tag to a selected commit, you have to specify the commit hash at the tip of the command, after the tag name.

5. Branch

After commits, branches are the bread and butter of Git. 99% of the time, you will probably be working inside a Git branch.

By default, the branch you’re on while you initialize Git inside a folder will probably be named either predominant or master.

Image by me. The master branch.

You possibly can consider other branches as alternate realities of your code base.

By making a Git branch, you possibly can test and experiment with recent features, ideas, and fixes without fearing you’ll mess up your code base.

For instance, you possibly can test a brand new algorithm for a classification task in a brand new branch without disrupting the predominant code base:

Image by me. Creating the brand new SGD branch.

Git branches are very low cost. While you call git branch new_branch_name, Git creates a pseudo-copy of the master branch without duplicating any of the files.

After making a recent branch and experimenting along with your fresh ideas, you could have the choice to delete the branch if the outcomes don’t seem promising. Alternatively, if you happen to are content with the changes made in the brand new branch, you possibly can merge it with the master branch.

6. HEAD

A Git repository can have several branches and lots of of commits. So you would possibly raise the superb query “How does Git know which branch or commit you’re at?”.

Git uses a special pointer called HEAD and that’s the answer.

Image by me. Switching of the HEAD.

Principally, the HEAD is you. Wherever you’re, HEAD follows you in Git. 99% of the time, HEAD will probably be pointing to the latest commit in the present branch.

For those who make a brand new commit, HEAD will move on to that. For those who switch to a brand new or an old branch, HEAD will switch to the newest commit in that branch.

One use-case for HEAD is when comparing changes in several commits to one another. For instance, calling git diff HEAD~1 will compare the newest commit to the commit immediately before it.

This also implies that HEAD~n syntax in Git refers back to the nth commit before wherever the HEAD is.

Image by me. HEAD~n syntax.

You could also go into the dreaded detached HEAD state. This doesn’t mean Git has lost track of you and doesn’t know where to point.

A detached head state occurs while you use the command git checkout HASH to examine out a selected commit, as an alternative of using git checkout branch_name. This forces the HEAD to now not point to the tip of a branch, but reasonably to a selected commit somewhere in the midst of the commit history.

Image by me. Detached HEAD state.

Any changes or commits you make within the detached HEAD state will probably be isolated or orphaned and won’t be a part of your Git history. The explanation is that HEAD is, well, the top of branches. It strongly fancies attaching itself to branch suggestions or heads, not its stomach or legs.

So, if you ought to make changes in a detached HEAD state, you need to call git switch -c new_branch to create a brand new branch at the present commit. This gets you out of the state and moves the HEAD.

Getting the hang of the HEAD will go a great distance in helping you navigate any tangled Git tree.

7. Merge

So, what happens after you create a brand new branch?

Do you discard it in case your experiment doesn’t pan out with git branch -d branch_name? Or do you perform a fabled Git merge?

Principally, a Git merge is a flowery party where two or much more branches come together to create a single thicker branch.

Image by me. Merging of two branches.

While you merge branches, Git takes the code from each branch and combines them right into a single cohesive code base.

If there are overlapping changes within the branches, i.e. each branches have modified lines 5–10 in train.py, Git raises a merge conflict.

A merge conflict is as nasty because it sounds. To resolve the conflict, you could have to make a decision which branch’s changes you ought to keep.

Solving merge conflicts without swearing and boiling from the ears is a rare skill developed over time. So, I won’t talk much about them and can refer you to this excellent article from Atlassian.

8. Stash

I are inclined to screw up rather a lot when coding. An idea strikes me; I try it out only to comprehend that it’s rubbish.

At first, I might foolishly erase the mess into oblivion but later regret it. Although the concept was rubbish, it doesn’t mean I couldn’t use certain code blocks in the longer term.

Then, I discovered Git stashes and so they quickly became one among my favorite Git features.

While you call git stash, Git mechanically stashes or hides each staged and unstaged changes within the working directory. The files revert back to a state where they only got here out of a commit.

image.png
Image by me. What happens in a stash?

After you stash your changes, you possibly can proceed your work as usual. When you ought to retrieve them again (anywhere), you should use the git stash apply or git stash pop command. These commands will restore the changes that were previously saved within the stash to the working directory.

Note that git stash command only saves changes made to tracked files and never untracked files. With a view to stash each tracked and untracked files, you have to use the -u flag with the git stash command. Ignored files is not going to be included within the stash.

9. GitHub

So, we come to the age-old query — what’s the difference between Git and GitHub?

That is like asking the difference between a burger and a cheeseburger.

Git is a version control system that tracks repositories. Alternatively, GitHub is a web-based platform used to store Git-controlled repositories online.

Git really shines when its repositories are made online and hence, open for collaboration. If a repository is barely in your local machine, people can’t work on it with you.

So, consider GitHub as a distant mirror of your local repo that individuals can clone, fork, and suggest pull requests.

And if these terms sound alien to you, stick around for my next article where I explain N (I don’t understand how many immediately) GitHub concepts that can clear the confusion straight away.

LEAVE A REPLY

Please enter your comment!
Please enter your name here