Big data and hadoop ecosystem

Big data Just data Structured data: data that has a defined length and format for each record. It’s stored in a fixed format such as a relational database or spreadsheet. It’s easy to search and analyze. It’s used for transactional data. Unstructured data: data that has an unknown length and format. It’s stored in a free format such as a text file. It’s difficult to search and analyze. It’s used for non-transactional data. Semi-structured data: data that has a defined length and format for each record but doesn’t conform to the structure of a relational database. It’s stored in a semi-structured format such as XML or JSON. It’s easy to search and analyze. It’s used for non-transactional data. Types of data analysis descriptive: what happened? diagnostic: why did it happen? predictive: what will happen? prescriptive: how can we make it happen? Data management software Hadoop Hadoop is a framework for distributed storage and processing of large data sets using the MapReduce programming model. It consists of a distributed file system (HDFS) and a distributed processing framework (MapReduce). It’s written in Java and is open source. It’s designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Its use cases include data lake, data warehouse, data hub, data science, and data engineering. It’s used by Facebook, Yahoo, LinkedIn, eBay, and Twitter. It’s core components are HDFS, YARN, and MapReduce. ...

December 5, 2023 · 20 min · 4093 words · Aum Pauskar

Image analysis with pytorch

Image Analysis using pytorch Prerequisites This project is built using python in Ubuntu (WSL) and you’ll need to install the following: Any bash terminal (one of the following) Conda WSL Mac OS Any flavour of Linux Python 3 (I’m using 3.10.12) 1 2 sudo apt update sudo apt install python3 Pip 1 2 sudo apt update sudo apt install python3-pip Packages Note: Since I’m using a computer with a CUDA compatable NVIDIA GPU, I’ll be using the GPU version of pytorch. If you don’t have a GPU, you can install the CPU version of pytorch given below. CPU install 1 pip3 install torch torchvision numpy matplotlib GPU install Installing numpy and matplotlib 1 pip3 install numpy matplotlib Installing pytorch Check the pytorch website to see the which library is compatable with your system. In my case I’m using CUDA 11.8 1 pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 Jupyter notebook (optional) 1 pip3 install jupyterlab Or just use Jupyter notebook from VS Code from here Environement I’ve used 1 CPU Ryzen 7 5800H 2 GPU RTX 3060 Laptop 3 RAM 2x8GB DDR4 @ 3200MHz 4 OS Windows 11/Ubuntu 22.4 WSL 5 CUDA 11.8 6 Python 3.10.12 MNIST number dataset The MNIST dataset is a dataset of handwritten digits. It has 60,000 training images and 10,000 test images. We’ll see a code to load the dataset and display the occurances of individual digits in the dataset. Or if you want to run the code from Jupyter notebook you can clone my repository via git. ...

December 3, 2023 · 5 min · 1062 words · Aum Pauskar

Extended markdown cheatsheet with KaTeX

Extended Markdown Cheatsheet Heading 1 2 3 4 5 6 # Heading 1 ## Heading 2 ### Heading 3 #### Heading 4 ##### Heading 5 ###### Heading 6 Paragraph 1 This is a paragraph. Code snippet 1 'print("Hello World")' block of code 1 2 3 ''' print("Hello World") ''' code within the clock of code can be selectively highlighted 1 2 3 '''python print("Hello World") ''' Emphasis 1 2 3 4 5 *This text will be italic* _This will also be italic_ **This text will be bold** __This will also be bold__ _You **can** combine them_ Table 1 2 3 4 | Syntax | Description | | ----------- | ----------- | | Header | Title | | Paragraph | Text | Bulletpoints 1 2 3 4 - Bulletpoint 1 - Bulletpoint 2 - Bulletpoint 2.1 - Bulletpoint 2.2 or ...

November 25, 2023 · 2 min · 287 words · Aum Pauskar

Git and GitHub cheatsheet

Git / github cheatsheet Installation of git Download git from here or use on debian based linux 1 2 sudo apt update sudo apt install git Check if git is installed 1 git --version Configure git 1 2 git config --global user.name "Your Name" git config --global user.email "Your email" (Optional) Change the default brach name from master to main 1 git config --global init.defaultBranch main Configuring SSH keys to github Run these commands on the terminal 1 2 3 4 ssh-keygen -t ed25519 -C "$Your email" eval "$(ssh-agent -s)" ssh-add ~/.ssh/id_ed25519 cat ~/.ssh/id_ed25519.pub Go to Github>Settings>SSH and GPG keys. Click on New SSH key. Give a title and paste the key in the key field. Click on Add SSH key. The key should be available from the last command on the terminal. Git commands git init - initialize a git repository git add <file> - add a file to the staging area git add . - add all files to the staging area git add -A - add all files to the staging area git commit -m "message" - commit changes to the local repository git push - push changes to the remote repository git push -u origin <branch_name> - push changes to a branch git push origin <branch_name> - push changes to a branch git pull - pull changes from the remote repository git pull origin <branch_name> - pull changes from a branch git status - check the status of the repository git log - view the commit history git branch - view the branches git branch <branch_name> - create a new branch git checkout <branch_name> - switch to a branch git checkout -b <branch_name> - create and switch to a branch git merge <branch_name> - merge a branch into the current branch git clone <url> - clone a remote repository git remote add origin <url> - add a remote repository git remote -v - view the remote repositories git remote set-url origin <url> - change the url of the remote repository git remote remove origin - remove the remote repository git submodule: git submodule is used to add a git repository inside another git repository. This is useful when you want to use a git repository inside another git repository. For example, you can use git submodule to add a git repository that contains a library to your project. This way, you can use the library in your project without having to copy the library files into your project directory. git submodule add <url> - add a submodule git submodule init - initialize the submodule git submodule update - update the submodule git submodule update --remote - update the submodule to the latest commit

November 25, 2023 · 3 min · 447 words · Aum Pauskar

OOPS with Python and packages

Python Chapter summary - Unit 1 Unit 1 Python Fundamentals: An Introduction to Python programming: Introduction to Python, IDLE to develop programs; How to write your first programs: Basic coding skills, data types and variables, numeric data, string data, five of the Python functions; Control statements: Boolean expressions, selection structure, iteration structure; Define and use Functions and Modules: define and use functions, more skills for defining and using functions and modules, create and use modules, standard modules Contents What is python? Python is a high level interpreted language that is preffered in rapid development of programs due to it’s easy and simple syntax. IDLE IDLE is the short form of integrated development learning environment. Data types in python int(5), string(’this is a string’ or this is a string), tuple( (5,3,5) ), float(5.3), bool(true/false) … Python interation structure Unlike other languages python uses indententation instead of using brackets. Comments Comments are a piece of code that is essentially “dead code” these are essential to show the programmer what the code does and not to do anything. Comments in python - ...

November 23, 2023 · 25 min · 5176 words · Aum Pauskar