Apache

Big data Just data Structured data: data that has a defined length and format for each record. It’s stored in a fixed format such as a relational database or spreadsheet. It’s easy to search and analyze. It’s used for transactional data. Unstructured data: data that has an unknown length and format. It’s stored in a free format such as a text file. It’s difficult to search and analyze. It’s used for non-transactional data. Semi-structured data: data that has a defined length and format for each record but doesn’t conform to the structure of a relational database. It’s stored in a semi-structured format such as XML or JSON. It’s easy to search and analyze. It’s used for non-transactional data. Types of data analysis descriptive: what happened? diagnostic: why did it happen? predictive: what will happen? prescriptive: how can we make it happen? Data management software Hadoop Hadoop is a framework for distributed storage and processing of large data sets using the MapReduce programming model. It consists of a distributed file system (HDFS) and a distributed processing framework (MapReduce). It’s written in Java and is open source. It’s designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Its use cases include data lake, data warehouse, data hub, data science, and data engineering. It’s used by Facebook, Yahoo, LinkedIn, eBay, and Twitter. It’s core components are HDFS, YARN, and MapReduce. ...