diff --git a/Cloud/readme.md b/Cloud/readme.md new file mode 100644 index 0000000..a94a9d9 --- /dev/null +++ b/Cloud/readme.md @@ -0,0 +1,218 @@ +- ISO and uses + - ISO 9001: + - quality management system standard + - ensure they meet customer and regulatory requirements. + - It focuses on continuous improvement and customer satisfaction. + - ISO 27001: + - information security management system standard that helps organizations manage and + - protect their information assets. + - managing sensitive company information, + - ensuring its confidentiality, integrity, and availability. + - ISO 14001: + - environmental management system standard + - minimize their environmental impact. + - manage environmental responsibilities in a systematic manner, + - contributing to sustainability. + - ISO 45001: + - occupational health and safety management system standard + - create a safe and healthy work environment. + - managing occupational health and safety risks and improving overall workplace safety. + - ISO 50001: + - energy management system standard + - improve their energy performance. + - manage energy use and reduce their energy consumption, + - leading to cost savings and environmental benefits. +- Cloud computing and uses + - Cloud computing + - delivery of computing services over the internet, + - allowing users to access and store data and applications on remote servers + - rather than on local devices. + - It offers + - scalability, flexibility, and cost-efficiency for businesses and individuals. + - uses + - data storage, + - software development, and + - hosting applications. + - access resources on-demand, + - collaborate remotely, + - scale their operations without the need for significant upfront investments in infrastructure. + - supports the growth of technologies such as + - artificial intelligence, + - big data analytics, + - Internet of Things (IoT). +- Firewall and uses + - Firewall + - network security device that monitors and controls incoming and outgoing network traffic based on predetermined security rules. + - It acts as a barrier between a trusted internal network and untrusted external networks, such as the internet. + - uses + - protect against unauthorized access, + - prevent cyberattacks, + - filter out malicious traffic, + - enforce security policies, + - monitor network activity for potential threats. + - port forwarding + - technique used to allow external devices to access services on a private network. + - It involves forwarding incoming traffic from a specific port on the router to a designated device or service within the internal network. + - uses + - hosting servers, + - remote desktop access, + - online gaming, + - accessing security cameras, + - enabling communication between devices across different networks. + - VPN (Virtual Private Network) + - secure and private network connection that allows users to access the internet or a private network securely. + - It creates an encrypted tunnel between the user's device and the VPN server, + - masking the user's IP address and providing anonymity online. + - uses + - protect sensitive data, + - bypass geo-restrictions, + - access blocked websites, + - enhance online privacy, + - secure remote work connections. + - Dynamic Content Filtering + - technique used to analyze and filter web content in real-time based on predefined rules and policies. + - It allows organizations to control and restrict access to certain types of content, such as websites, applications, or specific keywords. + - uses + - enhance security by blocking malicious or inappropriate content, + - enforce acceptable use policies, + - improve productivity by restricting access to non-work-related sites, + - protect against data leaks by filtering sensitive information from being transmitted over the network. +- Patch Management Systems + - Types of patch management systems + - manual patch management + - Manual patch management involves the process of manually identifying, downloading, and installing patches for software applications and operating systems. + - This approach requires IT administrators to regularly check for updates, assess their relevance and compatibility, and then apply them to the systems. + - It can be time-consuming and prone to human error, but it allows for greater control over the patching process and can be suitable for smaller organizations with limited resources. + - automated patch management + - Automated patch management refers to the use of software tools and systems to automatically identify, download, and install patches for software applications and operating systems. + - This approach streamlines the patching process by reducing the need for manual intervention, allowing IT administrators to schedule and manage patches more efficiently. + - Automated patch management can help ensure that systems are kept up-to-date with the latest security patches and updates, reducing the risk of vulnerabilities and improving overall system security. + - cloud-based patch management + - Cloud based patch management is a method of managing and deploying software patches and updates using cloud computing technology. + - It allows organizations to centrally manage and distribute patches to their systems and applications from a cloud-based platform. + - This approach offers benefits such as scalability, flexibility, and remote accessibility, making it easier for organizations to keep their systems up-to-date and secure without the need for on-premises infrastructure. + - integrated patch management + - Integrated patch management refers to the seamless integration of patch management processes and tools into an organization's overall IT infrastructure and operations. + - This approach involves incorporating patch management into existing IT workflows, such as change management and vulnerability management, to ensure that patches are applied in a timely and efficient manner. + - Integrated patch management can help organizations maintain a more secure and up-to-date IT environment by streamlining the patching process and reducing the risk of vulnerabilities. + - scheduled patch management + - Scheduled patch management is a method of managing and deploying software patches and updates according to a predetermined schedule. + - This approach allows organizations to plan and coordinate patching activities in advance, ensuring that patches are applied consistently and at appropriate times to minimize disruption to users and systems. + - Scheduled patch management can help organizations maintain a regular patching cadence, reducing the risk of vulnerabilities and improving overall system security. + - on-demand patch management + - On-demand patch management refers to the process of applying software patches and updates as needed, rather than following a predetermined schedule. + - This approach allows organizations to respond quickly to emerging threats and vulnerabilities by applying patches as soon as they become available. + - On-demand patch management can help organizations stay ahead of potential security risks and ensure that their systems are protected against the latest threats, but it may require more resources and attention to ensure that patches are applied in a timely manner. + - proactive patch management + - Proactive patch management is a strategic approach to managing software patches and updates that focuses on anticipating and addressing potential vulnerabilities before they can be exploited. + - This approach involves regularly monitoring for new vulnerabilities, assessing their potential impact, and applying patches proactively to mitigate risks. + - Proactive patch management can help organizations stay ahead of emerging threats and reduce the likelihood of security breaches by ensuring that systems are consistently updated and protected against known vulnerabilities. + - Reactive patch management + - Reactive patch management refers to the process of applying software patches and updates in response to identified vulnerabilities or security incidents. + - This approach involves monitoring for security threats and applying patches as needed to address vulnerabilities that have been exploited or are at risk of being exploited. + - Reactive patch management can help organizations respond quickly to emerging threats and mitigate risks, but it may also leave systems vulnerable if patches are not applied in a timely manner or if new vulnerabilities are not identified and addressed promptly. +- Types of cloud + - Virtual Private Cloud (VPC) + - A virtual private cloud (VPC) is a private, isolated section of a public cloud infrastructure that allows organizations to create and manage their own virtual network. + - It provides a secure and scalable environment for hosting applications and services while maintaining control over network configurations and security settings. + - VPCs are commonly used for hosting web applications, databases, and other workloads that require a high level of security and customization. +- Network Time Protocol (NTP) + - Network Time Protocol (NTP) is a protocol used to synchronize the clocks of computer systems over a network. + - It allows devices to maintain accurate time by communicating with a reference time source, such as an atomic clock or a GPS receiver. + - NTP is essential for ensuring that timestamps on logs, transactions, and other time-sensitive data are consistent across all systems in a network, which is crucial for troubleshooting, security, and overall system performance. + - NTP works by using a hierarchical system of time sources, where devices can synchronize with higher-level time servers that are more accurate and reliable. + - NTP can be used in various applications, including: + - Synchronizing time across servers and devices in a data center or cloud environment. + - Ensuring accurate timestamps for logging and auditing purposes. + - Coordinating time-sensitive operations, such as scheduled tasks and automated processes. + - Improving security by ensuring that time-based authentication mechanisms, such as Kerberos, function correctly. + - Maintaining accurate time for financial transactions and other critical applications that require precise timing. + - Enabling accurate timekeeping for distributed systems and applications that rely on synchronized clocks for proper functioning. + - Facilitating time-based analytics and monitoring by ensuring that data from different sources is accurately timestamped and can be correlated effectively. + - Supporting time-sensitive applications, such as online gaming, video streaming, and real-time communication, where accurate time synchronization is essential for optimal performance and user experience. + - Ensuring compliance with regulatory requirements that mandate accurate timekeeping for certain industries, such as finance and healthcare. + - Providing a reliable time source for devices and applications that do not have access to an external time source, such as those in isolated or secure environments. + - Enabling accurate timekeeping for scientific research and experiments that require precise timing for data collection and analysis. + - Improving overall system performance and reliability by ensuring that all devices and applications are synchronized to a common time source, reducing the likelihood of errors and inconsistencies caused by time discrepancies. +- System Time Protocol (STP) + - System Time Protocol (STP) is a protocol used to synchronize the clocks of computer systems within a local area network (LAN). + - It is designed to provide accurate time synchronization for devices that are connected to the same network, allowing them to maintain consistent time across all systems. + - STP works by using a master-slave architecture, where one device acts as the master clock and other devices synchronize their clocks to the master. + - This protocol is commonly used in environments where accurate timekeeping is essential, such as in financial institutions, telecommunications networks, and data centers. + - STP can help ensure that timestamps on logs, transactions, and other time-sensitive data are consistent across all systems in a network, which is crucial for troubleshooting, security, and overall system performance. +- Differential Backup + - Differential backup is a type of backup strategy that involves copying only the data that has changed since the last full backup. + - This means that each differential backup contains all the changes made since the last full backup, allowing for faster backups and restores compared to incremental backups. + - Differential backups are typically used in conjunction with full backups, where a full backup is performed periodically (e.g., weekly) and differential backups are performed more frequently (e.g., daily) to capture changes in between full backups. + - This approach can help reduce storage requirements and backup times while still providing a reliable way to restore data in case of loss or corruption. +- Types of firewalls + - Packet-filtering firewall + - A packet-filtering firewall is a type of network security device that monitors and controls incoming and outgoing network traffic based on predetermined security rules. + - It operates at the network layer of the OSI model and examines each packet of data to determine whether it should be allowed or blocked based on criteria such as source and destination IP addresses, port numbers, and protocol types. + - Packet-filtering firewalls are commonly used to protect networks from unauthorized access and to enforce security policies by filtering out potentially harmful traffic. + - Stateful inspection firewall + - A stateful inspection firewall, also known as a dynamic packet-filtering firewall, is a type of network security device that monitors and controls incoming and outgoing network traffic based on the state of active connections. + - It operates at the network and transport layers of the OSI model and keeps track of the state of each connection, allowing it to make more informed decisions about whether to allow or block traffic. + - Stateful inspection firewalls are more advanced than packet-filtering firewalls and can provide better security by allowing or blocking traffic based on the context of the connection, rather than just individual packets. + - Proxy firewall + - A proxy firewall, also known as an application-level gateway, is a type of network security device that acts as an intermediary between a user's device and the internet. + - It operates at the application layer of the OSI model and intercepts all incoming and outgoing traffic, allowing it to analyze and filter traffic based on specific application-level protocols, such as HTTP, FTP, or SMTP. + - Proxy firewalls can provide enhanced security by inspecting the content of the traffic and blocking potentially harmful requests, as well as by hiding the internal network's IP addresses from external sources. + - Next-generation firewall (NGFW) + - A next-generation firewall (NGFW) is an advanced type of network security device that combines traditional firewall capabilities with additional features such as intrusion prevention, application awareness, and deep packet inspection. + - NGFWs operate at multiple layers of the OSI model and can provide more comprehensive security by analyzing traffic in greater detail and allowing or blocking traffic based on a wider range of criteria, including application-level protocols, user identity, and behavior patterns. + - NGFWs are designed to address the evolving threat landscape and provide better protection against modern cyber threats, such as advanced malware, ransomware, and targeted attacks. + - Unified Threat Management (UTM) firewall + - A unified threat management (UTM) firewall is a type of network security device that combines multiple security features and functions into a single appliance. + - UTM firewalls typically include features such as firewall protection, intrusion detection and prevention, antivirus and anti-malware capabilities, content filtering, and virtual private network (VPN) support. + - UTM firewalls are designed to provide comprehensive security for small to medium-sized businesses by offering a simplified and cost-effective solution for managing multiple security functions in one device. + - They can help organizations protect against a wide range of threats while reducing the complexity and overhead associated with managing multiple security appliances. +- Software Defined Storage (SDS) + - Software-defined storage (SDS) is a storage architecture that separates the storage hardware from the software that manages it. + - SDS allows organizations to manage and provision storage resources using software, rather than relying on specific hardware configurations. + - This approach provides greater flexibility, scalability, and cost-efficiency, as it enables organizations to use commodity hardware and easily scale storage capacity as needed. + - SDS can be used in various environments, including data centers, cloud computing, and virtualized environments, to provide efficient and agile storage solutions that meet the demands of modern applications and workloads. +- RAID + - RAID (Redundant Array of Independent Disks) + - data storage technology that combines multiple physical disk drives into a single logical unit + - for improved performance, redundancy, or both. + - RAID levels include: + - RAID 0: + - Striped set without parity, providing improved performance but no redundancy. + - RAID 1: + - Mirrored set, providing redundancy by duplicating data across two or more disks. + - RAID 5: + - Striped set with distributed parity, + - providing a balance of performance and redundancy by distributing parity information + - across all disks. + - RAID 6: + - Striped set with double distributed parity, + - providing enhanced redundancy by allowing for the failure of two disks without data loss. + - RAID 10: + - A combination of RAID 1 and RAID 0, + - providing both improved performance and redundancy by mirroring and + - striping data across multiple disks. + - RAID is commonly used in data centers, servers, and storage systems to improve performance and ensure data availability in case of disk failures. It can help organizations protect their data and maintain business continuity by providing redundancy and fault tolerance, while also improving read and write performance for applications that require high-speed access to data. +- Hypervisor + - A hypervisor, + - also known as a virtual machine monitor (VMM), + - is a software layer that allows multiple virtual machines (VMs) + - to run on a single physical host. + - It abstracts the underlying hardware and provides each VM with its own virtualized environment, + - including CPU, memory, storage, and network resources. + - Hypervisors can be classified into two main types: + - Type 1 (bare-metal) hypervisors, which run directly on the host's hardware, and + - Type 2 (hosted) hypervisors, which run on top of an existing operating system. + - Hypervisor Functions: + - Resource Allocation: Hypervisors manage and allocate physical resources such as CPU, memory, and storage to each virtual machine, ensuring that they have the necessary resources to operate effectively. + - Isolation: Hypervisors provide isolation between virtual machines, preventing them from interfering with each other and ensuring that any issues or failures in one VM do not affect others. + - Virtualization: Hypervisors enable the creation of virtual machines that can run different operating systems and applications on the same physical hardware, allowing for greater flexibility and efficiency in resource utilization. + - Management: Hypervisors offer management tools and interfaces that allow administrators to create, configure, monitor, and manage virtual machines, making it easier to deploy and maintain virtualized environments. + - Security: Hypervisors can provide security features such as access controls, encryption, and network segmentation to protect virtual machines and the underlying host from unauthorized access and potential threats. + - Memory ballooning: + - Hypervisors can dynamically adjust the amount of memory allocated to virtual machines + - based on their current needs, allowing for more efficient use of memory resources. + - Live migration: Hypervisors can enable the live migration of virtual machines from one physical host to another without downtime, allowing for load balancing, maintenance, and disaster recovery. + - Snapshots: Hypervisors can create snapshots of virtual machines, allowing administrators to capture the state of a VM at a specific point in time and revert back to it if needed, which is useful for testing and troubleshooting. + - Virtual networking: Hypervisors can create virtual networks that allow virtual machines to communicate with each other and with the outside world, providing network connectivity and isolation as needed. + - Storage virtualization: Hypervisors can abstract and manage storage resources, allowing virtual machines to access and utilize storage in a flexible and efficient manner, such as through virtual disks or shared storage pools. diff --git a/Data Management/big_data.png b/Data Management/big_data.png new file mode 100644 index 0000000..e51161a Binary files /dev/null and b/Data Management/big_data.png differ diff --git a/Data Management/data_architecture.md b/Data Management/data_architecture.md new file mode 100644 index 0000000..5a04418 --- /dev/null +++ b/Data Management/data_architecture.md @@ -0,0 +1,48 @@ +- Data Architecture + - Data architecture is the design and structure of an organization's data assets, including how data is collected, stored, processed, and accessed. It encompasses the overall framework and guidelines for managing data within an organization. + - Key components of data architecture include: + - Data models: These define the structure and relationships of data entities, such as tables in a database or objects in a data warehouse. + - Data storage: This refers to the physical or cloud-based infrastructure used to store data, such as databases, data lakes, or data warehouses. + - Data integration: This involves the processes and tools used to combine data from different sources into a unified view, such as ETL (Extract, Transform, Load) processes or data pipelines. + - Data governance: This encompasses the policies and procedures for managing data quality, security, privacy, and compliance within an organization. + - A well-designed data architecture can help organizations efficiently manage their data assets, improve data quality and accessibility, and support better decision-making. + - Benefits of a good data architecture include: + - Improved data quality and consistency: A well-defined data architecture can help ensure that data is accurate, complete, and consistent across the organization. + - Enhanced data accessibility: A clear data architecture can make it easier for users to find and access the data they need for analysis and decision-making. + - Increased agility: A flexible data architecture can allow organizations to quickly adapt to changing business needs and incorporate new data sources or technologies. + - Better decision-making: With a well-structured data architecture, organizations can leverage their data assets more effectively to gain insights and make informed decisions. + - Challenges in data architecture include: + - Complexity: Designing and maintaining a data architecture can be complex, especially for large organizations with diverse data sources and requirements. + - Data silos: Different departments or teams may have their own data systems and processes, leading to fragmented data and difficulties in integration. + - Data quality issues: Poor data quality can undermine the effectiveness of a data architecture and lead to inaccurate insights and decisions. + - Security and privacy concerns: Ensuring the security and privacy of data is a critical aspect of data architecture, especially with increasing regulations and cyber threats. + - To address these challenges, organizations can adopt best practices such as: + - Establishing clear data governance policies and procedures to ensure data quality, security, and compliance. + - Implementing data integration tools and processes to break down data silos and create a unified view of data across the organization. + - Investing in training and resources to build data literacy and skills among employees, enabling them to effectively use and manage data. + - Regularly reviewing and updating the data architecture to ensure it remains aligned with business needs and technological advancements. +- Asset Management softwares: + - SolarWinds Network Configuration Manager: This software helps organizations manage and monitor their network devices, including routers, switches, and firewalls. It provides features such as configuration management, compliance auditing, and network performance monitoring. + - Manage Engine AssetExplorer: This software offers a comprehensive solution for managing IT assets, including hardware, software, and cloud resources. It provides features such as asset tracking, inventory management, and lifecycle management. + - ServiceNow Asset Management: This software offers a comprehensive solution for managing IT assets, including hardware, software, and cloud resources. It provides features such as asset tracking, inventory management, and lifecycle management. + - IBM Maximo Asset Management: This software is designed for managing physical assets, such as equipment, facilities, and infrastructure. It provides features such as asset tracking, maintenance management, and work order management. + - Microsoft System Center Configuration Manager: This software helps organizations manage their IT assets, including hardware and software. It provides features such as inventory management, software deployment, and patch management. + - BMC Helix Asset Management: This software offers a comprehensive solution for managing IT assets, including hardware, software, and cloud resources. It provides features such as asset tracking, inventory management, and lifecycle management. + - SAP Asset Manager: This software is designed for managing physical assets, such as equipment, facilities, and infrastructure. It provides features such as asset tracking, maintenance management, and work order management. + - Oracle Enterprise Asset Management: This software helps organizations manage their physical assets, including equipment, facilities, and infrastructure. It provides features such as asset tracking, maintenance management, and work order management. + - Infor EAM: This software offers a comprehensive solution for managing physical assets, including equipment, facilities, and infrastructure. It provides features such as asset tracking, maintenance management, and work order management. + - Asset Panda: This software is designed for managing IT assets, including hardware and software. It provides features such as asset tracking, inventory management, and lifecycle management. + - Freshservice Asset Management: This software offers a comprehensive solution for managing IT assets, including hardware and software. It provides features such as asset tracking, inventory management, and lifecycle management. + - Ivanti Asset Manager: This software helps organizations manage their IT assets, including hardware and software. It provides features such as asset tracking, inventory management, and lifecycle management. + - Lansweeper: This software offers a comprehensive solution for managing IT assets, including hardware and software. It provides features such as asset tracking, inventory management, and lifecycle management. + - SysAid Asset Management: This software is designed for managing IT assets, including hardware and software. It provides features such as asset tracking, inventory management, and lifecycle management. + - Spiceworks Asset Management: This software offers a comprehensive solution for managing IT assets, including hardware and software. It provides features such as asset tracking, inventory management, and lifecycle management. + - Gensuite Asset Management: This software is designed for managing physical assets, such as equipment, facilities, and infrastructure. It provides features such as asset tracking, maintenance management, and work order management. + - UpKeep Asset Management: This software offers a comprehensive solution for managing physical assets, such as equipment, facilities, and infrastructure. It provides features such as asset tracking, maintenance management, and work order management. + - Fiix Asset Management: This software is designed for managing physical assets, such as equipment, facilities, and infrastructure. It provides features such as asset tracking, maintenance management, and work order management. + - eMaint Asset Management: This software offers a comprehensive solution for managing physical assets, such as equipment, facilities, and infrastructure. It provides features such as asset tracking, maintenance management, and work order management. + - MPulse Asset Management: This software is designed for managing physical assets, such as equipment, facilities, and infrastructure. It provides features such as asset tracking, maintenance management, and work order management. + - AssetCloud: This software offers a comprehensive solution for managing physical assets, such as equipment, facilities, and infrastructure. It provides features such as asset tracking, maintenance management, and work order management. + - AssetWorks: This software is designed for managing physical assets, such as equipment, facilities, and infrastructure. It provides features such as asset tracking, maintenance management, and work order management. + - Asset Infinity: This software offers a comprehensive solution for managing physical assets, such as equipment, facilities, and infrastructure. It provides features such as asset tracking, maintenance management, and work order management. + - AssetTiger: This software is designed for managing physical assets, such as equipment, facilities, and infrastructure. It provides features such as asset tracking, maintenance management, and work order management. diff --git a/Data Management/readme.md b/Data Management/readme.md new file mode 100644 index 0000000..9514834 --- /dev/null +++ b/Data Management/readme.md @@ -0,0 +1,1637 @@ +- Big Data: + - [big_data](big_data.png) + - large volume of structured and unstructured data + - generated by organizations and individuals on a daily basis. + - Characterized by the 3Vs: + - Volume, + - Velocity, and + - Variety. + - analyzed for insights + - lead to better decisions and + - strategic business moves. + - Big Data + - Hadoop + - Hadoop Distributed File System (HDFS) + - distributed file system + - allows for the storage of large data sets across multiple machines. + - provides high throughput access to data and + - is designed to be fault-tolerant. + - MapReduce algorithm + - programming model and an associated implementation for processing and generating large data sets + - with a parallel, distributed algorithm on a cluster. + - Spark + - Hive + - Pig + - HBase + - Flume + - Sqoop + - Kafka + - Zookeeper + - Oozie + - Mahout + - Cassandra + - MongoDB + - Redis + - Neo4j + - GraphX + - Presto + - Impala + - Drill + - Airflow + - Data Lakes + - Data Warehouses + - e.g. + - Hadoop Datalake on parquet files in AWS environment + - Amazon Redshift, + - Google BigQuery, + - Snowflake, + - Microsoft Azure Synapse Analytics, + - Teradata, Oracle Exadata, + - IBM Db2 Warehouse, + - SAP HANA, + - Vertica, + - Greenplum, + - Netezza, + - Cloudera Data Warehouse, + - Apache Hive, + - Apache Impala, + - Apache Drill + - ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) + - get data from one data store to another, + - typically involves three main steps: + - Extract: + - The process of retrieving data from various source systems, + - Transform: + - The process of cleaning, + - enriching, and + - transforming the data into a format suitable for analysis and reporting, + - common transformations include: + - union, + - join, + - filter, + - sort, + - aggregate, + - pivot, + - unpivot, + - regex, and + - more complex transformations such as: + - Data Cleaning + - Data Enrichment + - Data Aggregation + - Data Filtering + - Data Sorting + - Data Joining + - Data Splitting + - Data Validation + - prevent errors at source + - ensure accuracy and consistency of data + - manual or automated rules + - Data Standardization + - Data Normalization + - Data Denormalization + - Data Masking + - Data Anonymization + - Data Encryption + - Load: The process of loading the transformed data into a target data store, such as a data warehouse or data lake. + - ETL tools include: + - Informatica PowerCenter + - Syncsort DMX and DMX-h + - h=hadoop + - Microsoft SQL Server Integration Services (SSIS) + - Apache NiFi + - Open Source ETL Tools + - Talend Open Studio + - Pentaho Data Integration (PDI) + - JasperETL + - CloverETL + - Kettle + - Apache Camel + - Python or Java pipelines using libraries such as: + - Pandas + - NumPy + - PySpark + - Apache Beam + - Apache Airflow + - IBM Infosphere DataStage + - Oracle Data Integrator (ODI) + - AWS Glue + - Google Cloud Dataflow + - Apache Airflow + - Apache Spark + - Apache Flink + - Apache Beam + - Apache Kafka Connect + - Apache NiFi + - Apache Sqoop + - Apache Oozie + - Apache Pig + - Apache Hive + - Apache Impala + - Apache Drill + - Apache HBase + - Apache Cassandra + - Apache MongoDB + - Apache Redis + - Apache Neo4j + - Apache GraphX + - Apache Presto + - connect to source databses using: + - JDBC (Java Database Connectivity) + - ODBC (Open Database Connectivity) + - REST API (Application Programming Interface) + - File-based connectors (e.g., CSV, JSON, XML) + - Cloud-based connectors (e.g., AWS S3, Google Cloud Storage, Azure Blob Storage) + - Custom connectors developed using programming languages such as Java, Python, or Scala. + - OLTP (Online Transaction Processing) + - OLAP (Online Analytical Processing) + - Data Science + - Machine Learning + - Deep Learning + - Natural Language Processing (NLP) + - Computer Vision + - Big Data Analytics + - Data Visualization + - Data Governance + - Process of managing the + - availability, + - usability, + - integrity, and + - security of the data used in an organization. + - Add data governance checkpoints in SDLC especially at: + - data transformation, + - reporting + - Examples of data governance practices include: + - Corporate Glossary + - Terminology management system that + - defines and standardizes the business terms used in an organization. + - Data Catalog + - inventory of data assets within an organization + - supports searching and tagging of data assets, + - governed with access controls and data lineage tracking. + - supports data classification and data quality management. + - sensitive data + - public data + - Data Lineage + - tracking and documenting the flow of data from its origin to its destination, + - including all transformations, processes, and + - systems that the data passes through along the way. + - Detect PII Data + - Sensitive PII + - personally identifiable information + - e.g. + - Full Name + - SSN - Social Security Number + - Driving License + - Financial Information + - Credit Cards + - Passport information + - Medical Records + - Email + - Phone number + - Biometric information + - Non-sensitive PII + - Gender + - Date of Birth + - Race + - Place of Birth + - Religion + - quasi identifiers + - Date of birth + - race + - Fine Grain security controls + - authorization + - RBAC + - Group membership + - Audit Logs + - audit trail + - record of events and changes of IT devices + - Data Auditing + - track activities related to data, such as data access, data changes, and data usage + - logs must be securely stored and protected from unauthorized access or tampering. + - enable reporting and analysis of data activities to identify potential security breaches, + - data quality issues, or + - compliance violations. + - enable alerting and notification of suspicious or unauthorized activities related to data. + - Naming Conventions + - Abbreviations + - Acronyms + - CamelCase + - snake_case + - kebab-case + - Data Classification + - Change Management policy for data + - process for managing changes to data, + - including how changes are proposed, reviewed, approved, and implemented. + - Documentation of changes made to data, + - including the reason for the change, + - the person responsible for the change, and + - the date of the change. + - Approval process for changes to data, + - including who needs to approve the change and + - how the change will be implemented. + - Communication plan for changes to data, + - including how the change will be communicated to stakeholders and + - how feedback will be collected. + - Enterprise Data Dictionary + - Superset of all data dictionaries in an organization, + - providing a comprehensive view of the organization's data assets. + - Includes metadata about the data, such as its source, format, and usage. + - Name + - Description + - Data Type + - Owner + - Steward + - Source + - Usage + - Security Classification + - Information Security + - Identity and Access Management (IAM) + - Data Encryption + - Data Masking + - Data Loss Prevention (DLP) + - Compliance + - PCI DSS + - Payment Card Industry Data Security Standard (PCI DSS) + - set of security standards designed to ensure that all companies that + - accept, + - process, + - store or + - transmit credit card information maintain a secure environment. + - HIPAA + - GDPR + - CCPA + - SOX + - FISMA + - NIST + - ISO 27001 + - SOC 2 + - FedRAMP + - CSA STAR + - Cloud Security Alliance (CSA) + - Sarbanes-Oxley Act (SOX) + - Federal Information Security Management Act (FISMA) + - National Institute of Standards and Technology (NIST) + - International Organization for Standardization (ISO) 27001 + - Service Organization Control (SOC) 2 + - Federal Risk and Authorization Management Program (FedRAMP) + - Cloud Security Alliance (CSA) STAR + - Information Security Management System (ISMS) + - systematic approach to managing sensitive company information so that it remains secure. + - includes people, processes and IT systems by applying a risk management process. + - ISO 27001 + - InfoSec + - Information Life Cycle Management + - Roles in data governance include: + - Chief Data Officer (CDO) + - Data Governance Manager/Director + - Data Steward + - Stakeholder responsible for ensuring the quality and integrity of the data + - within a specific domain or area of the organization. + - from business + - SME (Subject Matter Expert) + - Individual with specialized knowledge and expertise in a particular area or domain. + - often consulted for their insights and guidance on specific topics or + - issues related to data governance. + - Data Owner + - Data Custodian + - Data Analyst + - Data Scientist + - Data Engineer + - Enterprise Information Architect + - Data Governance Committee + - Data Modeler + - Data Quality Analyst + - Data Privacy Officer + - Data Compliance Officer + - Data Security Officer + - Data Governance Council + - Data Governance Board + - Data Stewardship + - Data Ownership + - Data Policies + - Data Standards + - Data Compliance + - Data Security + - Data Privacy + - Data Quality + - naming conventions, + - data validation, + - Data Type Validation + - Range Validation + - Format Validation + - Uniqueness Validation + - Referential Integrity Validation + - Consistency Validation + - Completeness Validation + - Accuracy Validation + - Timeliness Validation + - Data Lineage + - Process of tracking and documenting the flow of data from its origin to its destination, + - including all transformations, processes, and + - systems that the data passes through along the way. + - Data Lineage provides visibility into the data's journey, + - allowing organizations to understand where the data came from, + - how it has been transformed, and + - where it is being used. + - data cleansing, + - correcting error + - removing duplicates + - merge duplicate records + - standardize data formats + - update outdated information + - data profiling, + - profile data to understand its structure, + - content, and + - quality, + - identify data quality issues and + - make informed decisions about how to improve the data. + - Data profiling can be used to identify patterns and trends in the data, + - such as missing values, duplicate records, or inconsistent formatting, + - which can help organizations improve the quality of their data and make better use of it for analysis and decision-making. + - Process: + - Analyze Data + - Detect anomalies + - Use cases: + - Query optimization + - Data cleansing + - Data migration + - Data integration + - Scientific Data Management Analysis + - Project Management + - Data Discovery + - data enrichment, + - data scoring, + - score data based on predefined criteria to assess its quality, + - reliability, or relevance for specific use cases. + - Data scoring can be used to prioritize data for analysis, + - identify high-quality data for decision-making, and + - improve the overall quality of the data used in an organization. + - Data statistics + - Descriptive Statistics + - Inferential Statistics + - Predictive Statistics + - Prescriptive Statistics + - Data Distribution + - Data Central Tendency + - Data Variability + - Data Correlation + - Data Regression Analysis + - Data Hypothesis Testing + - Data Sampling Techniques + - Data Visualization Techniques + - Data Integration + - Data Management + - Data Architecture + - Data Engineering + - Data Engineering Systems + - Vertica (columnar database) + - SAS environment (data analytics platform) + - R Studio Server(data analysis and visualization tool) + - Anaconda Enterprise Server (data science platform) + - Hadoop (big data processing framework) + - Data Munging + - Data Wrangling + - Data Science Tools + - Big Data Tools + - Big Data Frameworks + - Big Data Technologies + - Big Data Platforms + - Big Data Ecosystem + - Big Data Use Cases + - Big Data Challenges + - Big Data Solutions + - Big Data Trends + - Big Data Future + - Data Modelling Software + - ER Studio + - ERwin Data Modeler + - PowerDesigner + - Toad Data Modeler + - Oracle SQL Developer Data Modeler + - MySQL Workbench + - Microsoft Visio + - Lucidchart + - Draw.io + - Open Source Data Modelling Tools + - ER One + - Open ModelSphere + - DBDesigner + - SQL Power Architect + - Vertabelo + - ModelRight + - ERBuilder + - Dataedo + - ERDPlus + - QuickDBD + - GenMyModel + - ERD Lab + - Dia + - DBeaver + - DbSchema + - SchemaSpy + - SchemaCrawler + - SchemaSpy + - SchemaCrawler + - Data Modelling Tools + - Data Modelling Techniques + - Kimball + - Inmon + - Data Vault + - Data Modelling Best Practices + - Data Modelling Examples + - Data Modelling Types + - Conceptual Data Modelling + - Logical Data Modelling + - Physical Data Modelling + - Data Modelling Notations + - Entity-Relationship (ER) Notation + - Unified Modeling Language (UML) Notation + - Object-Role Modeling (ORM) Notation + - IDEF1X Notation + - Barker Notation + - Crow's Foot Notation + - Chen Notation + - Bachman Notation + - Martin Notation + - Information Engineering (IE) Notation + - Data Vault Notation + - Anchor Modelling Notation + - Dimensional Modelling Notation + - Star Schema Notation + - Snowflake Schema Notation + - Data Vault 2.0 Notation + - Graph Data Modelling Notation + - NoSQL Data Modelling Notation + - XML Data Modelling Notation + - JSON Data Modelling Notation + - RDF Data Modelling Notation + - OWL Data Modelling Notation + - ERwin Notation + - PowerDesigner Notation + - ER/Studio Notation + - Toad Data Modeler Notation + - Oracle SQL Developer Data Modeler Notation + - MySQL Workbench Notation + - Microsoft Visio Notation + - Lucidchart Notation + - Draw.io Notation + - Gliffy Notation + - Creately Notation + - SmartDraw Notation + - Visual Paradigm Notation + - Sparx Systems Enterprise Architect Notation + - IBM Rational Software Architect Notation + - Sparx Systems Pro Cloud Notation + - ER/Studio Data Architect Notation + - ERwin Data Modeler Notation + - PowerDesigner Data Architect Notation + - Toad Data Modeler Notation + - Oracle SQL Developer Data Modeler Notation + - MySQL Workbench Data Modeler Notation + - Microsoft Visio Data Modeler Notation + - Lucidchart Data Modeler Notation + - Draw.io Data Modeler Notation + - Gliffy Data Modeler Notation + - Creately Data Modeler Notation + - SmartDraw Data Modeler Notation + - Visual Paradigm Data Modeler Notation + - Sparx Systems Enterprise Architect Data Modeler Notation + - IBM Rational Software Architect Data Modeler Notation + - Sparx Systems Pro Cloud Data Modeler Notation + - Enterprise Data Architecture + - Data Architecture Frameworks + - The Open Group Architecture Framework (TOGAF) + - Business Architecture + - Model business requirements and processes to align with organizational goals and objectives. + - Information Systems Architecture + - Data lifecycle management, + - data governance, and + - data quality to ensure that data is accurate, consistent, and secure. + - Application Architecture + - Application logic and SDLC + - Data Architecture + - Technology Architecture + - Infrastructure Modeling + - Network Architecture + - Security Architecture + - Cloud Architecture + - Architecture Development Method (ADM) + - Architecture Content Framework + - Architecture Repository + - Architecture Capability Framework + - Zachman Framework + - Federal Enterprise Architecture Framework (FEAF) + - Gartner Enterprise Architecture Framework + - Department of Defense Architecture Framework (DoDAF) + - Ministry of Defence Architecture Framework (MODAF) + - Blockchain + - distributed ledger technology that allows for secure and transparent transactions + - without the need for a central authority(broker or intermediary) + - like banks or governments or Amazon Marketplace + - used in various industries, including finance, supply chain management, healthcare, and more. + - Key features of blockchain include: + - Decentralization: Blockchain operates on a decentralized network of computers, + - eliminating the need for a central authority to manage transactions. + - Transparency: All transactions on the blockchain are visible to all participants, + - ensuring transparency and accountability. + - Security: Blockchain uses cryptographic techniques to secure transactions and + - prevent unauthorized access or tampering. + - Immutability: Once a transaction is recorded on the blockchain, + - it cannot be altered or deleted, ensuring the integrity of the data. + - Consensus Mechanism: Blockchain relies on consensus mechanisms, + - such as Proof of Work (PoW) or + - Proof of Stake (PoS), + - to validate and confirm transactions on the network. + - Common use cases for blockchain include: + - Cryptocurrencies: Bitcoin, Ethereum, and other cryptocurrencies + - are built on blockchain technology, + - allowing for secure and decentralized digital currency transactions. + - Supply Chain Management: Blockchain can be used to track and verify the movement of goods + - through the supply chain, ensuring transparency and reducing fraud. + - transaction can be: + - peer-to-peer transactions, + - smart contracts, + - decentralized applications (DApps), + - identity management, + - voting systems, + - healthcare records management, + - real estate transactions, + - intellectual property protection, + - financial services, + - and more. + - Smart contracts + - self-executing contracts with the terms of the agreement directly written into code. + - run on blockchain platforms, such as Ethereum, and + - automatically execute when predefined conditions are met. + - Key features of smart contracts include: + - Automation: Smart contracts automatically execute when predefined conditions are met, + - eliminating the need for intermediaries and reducing the risk of human error. + - Transparency: The terms of the contract are visible to all parties involved, + - ensuring transparency and accountability. + - Security: Smart contracts use cryptographic techniques to secure transactions and + - prevent unauthorized access or tampering. + - Immutability: Once a smart contract is deployed on the blockchain, + - it cannot be altered or deleted, ensuring the integrity of the agreement. + - Common use cases for smart contracts include: + - Financial Services: Smart contracts can be used for various financial applications, + - such as decentralized finance (DeFi) platforms, + - insurance claims processing, and + - cross-border payments. + - Supply Chain Management: Smart contracts can automate and verify transactions in the supply chain, + - ensuring transparency and reducing fraud. + - Real Estate: Smart contracts can facilitate property transactions, + - automating processes such as title transfers and rental agreements. + - Healthcare: Smart contracts can be used to manage healthcare records, + - automate insurance claims, and facilitate secure data sharing between healthcare providers. + - Identity Management: Smart contracts can be used to manage digital identities, + - allowing individuals to control their personal information and securely share it with trusted parties. + - Voting Systems: Smart contracts can be used to create secure and transparent voting systems, + - ensuring the integrity of the voting process and reducing the risk of fraud. + - Intellectual Property Protection: Smart contracts can be used to manage intellectual property rights, + - automate royalty payments, and facilitate secure licensing agreements. + - DApps (Decentralized Applications) + - applications that run on a decentralized network, + - such as a blockchain, + - rather than on a centralized server. + - DApps are designed to be open-source, transparent, and resistant to censorship. + - Key features of DApps include: + - Decentralization: DApps operate on a decentralized network, + - eliminating the need for a central authority to manage the application. + - Transparency: The code and data of DApps are visible to all participants, + - ensuring transparency and accountability. + - Security: DApps use cryptographic techniques to secure transactions and + - prevent unauthorized access or tampering. + - Immutability: Once a DApp is deployed on the blockchain, + - it cannot be altered or deleted, ensuring the integrity of the application. + - Common use cases for DApps include: + - Decentralized Finance (DeFi): DApps can be used for various financial applications, + - such as lending platforms, decentralized exchanges, and yield farming protocols. + - Gaming: DApps can be used to create decentralized gaming platforms, + - allowing players to own and trade in-game assets securely. + - Social Media: DApps can be used to create decentralized social media platforms, + - giving users control over their data and content. + - Supply Chain Management: DApps can automate and verify transactions in the supply chain, + - ensuring transparency and reducing fraud. + - Real Estate: DApps can facilitate property transactions, + - automating processes such as title transfers and rental agreements. + - Healthcare: DApps can be used to manage healthcare records, + - automate insurance claims, and facilitate secure data sharing between healthcare providers. + - Identity Management: DApps can be used to manage digital identities, + - allowing individuals to control their personal information and securely share it with trusted parties. + - Voting Systems: DApps can be used to create secure and transparent voting systems, + - ensuring the integrity of the voting process and reducing the risk of fraud. + - Intellectual Property Protection: DApps can be used to manage intellectual property rights, + - automate royalty payments, and facilitate secure licensing agreements. + - Consensus algorithms + - mechanisms used in blockchain networks to achieve agreement among distributed nodes on the state of the blockchain. + - Common consensus algorithms include: + - Proof of Work (PoW) + - Proof of Stake (PoS) + - Delegated Proof of Stake (DPoS) + - Practical Byzantine Fault Tolerance (PBFT) + - Raft + - Tendermint + - Federated Byzantine Agreement (FBA) + - Proof of Authority (PoA) + - Proof of Elapsed Time (PoET) + - Proof of Space (PoSpace) + - Proof of Capacity (PoC) + - Proof of Burn (PoB) + - Proof of Activity (PoA) + - Proof of Importance (PoI) + - Proof of History (PoH) + - Proof of Weight (PoWeight) + - Proof of Space-Time (PoST) + - Types of Data + - Structured Data + - Unstructured Data + - Semi-Structured Data + - Time-Series Data + - Geospatial Data + - Graph Data + - Text Data + - Image Data + - Video Data + - Audio Data + - Sensor Data + - Log Data + - Social Media Data + - Machine Data + - Internet of Things (IoT) Data + - Big Data + - Types of Data in Data Modelling + - Entity + - Attribute + - Relationship + - Primary Key + - Foreign Key + - Composite Key + - Surrogate Key + - Candidate Key + - Alternate Key + - Unique Key + - Index + - View + - Stored Procedure + - Function + - Trigger + - Schema + - Table + - Column + - Row + - Data Type + - Null Value + - Default Value + - Constraint + - Check Constraint + - Unique Constraint + - Primary Key Constraint + - Foreign Key Constraint + - Not Null Constraint + - Default Constraint + - Index Constraint + - View Constraint + - Stored Procedure Constraint + - Function Constraint + - Trigger Constraint + - Schema Constraint + - Types of Data in Data Architecture + - Master Data + - single source of truth for critical business entities and attributes. + - core data essential to the operations of a business. + - information about customers, + - products, + - suppliers, and + - other key entities that are critical to the functioning of the organization. + - used across multiple systems and applications. + - doesnot change frequently and is typically maintained in a centralized repository, + - such as a master data management (MDM) system. + - examples include eye color, + - blood type, + - serial number of a product, + - building street address, + - e.g. CRM, ERP, MDM + - Reference Data + - Transactional Data + - generated from day-to-day operations of business and + - captures specific events or transactions. + - Examples include + - sales transactions, + - customer interactions, and + - financial transactions. + - time-stamped and + - used for operational reporting and analysis. + - lookup data + - used to populate dropdowns, + - validate input, and + - provide reference information for other data elements. + - Analytical Data + - Metadata + - data that provides information about other data. + - describes the characteristics, structure, and context of data, + - making it easier to understand, manage, and use the data effectively. + - Examples of metadata include: + - Data Type: Describes the type of data + - (e.g., string, integer, date). + - Data Source: Indicates where the data originated from + - (e.g., database, file, API). + - Data Owner: + - Identifies the person or team responsible for the data. + - Data Steward: + - Identifies the person or team responsible for managing the data. + - Data Quality: + - Provides information about the quality of the data + - (e.g., accuracy, completeness, consistency). + - Data Usage: + - Describes how the data is used within the organization + - (e.g., reporting, analytics, operational processes). + - Data Security: + - Indicates the security classification of the data + - (e.g., confidential, public, internal). + - Data Retention: + - Specifies how long the data should be retained and + - when it should be archived or deleted. + - metadata management best practices + - Define metadata strategy + - based on long term business strategy + - Link metadata management with digital transformation + - Establish Scope and Ownership + - Identify use cases for metadata + - Define rules of metadata creators, consumers and managers + - Prioritise data assets + - Identify key functionalities + - Choose tool with intelligent optimization, data discovery and use case analysis + - Adopt metadata standards + - uniform usage and interpretations with vendors and customers + - interoperable data + - Benefits of metadata management + - Artifacts tracking + - Tracking the lineage + - Model Parameters Analysis + - highlight data quality and data privacy issues + - use data relationships to identify new features for enriching data science models + - MLMD database + - library for recording and retrieving metadata + - part of TensorFlow extended(TFX) + - registers the following metadata in Metadata store database + - artifacts generated through components or steps in ML pipeline + - executions of these components + - metadata about pipelines and associate lineage information + - Shows DAG(Directed Acyclic Graph) of all executions + - supports SQLLite, inmemory and disk and MySQL + - + - Big Data + - Real-Time Data + - Historical Data + - Unstructured Data + - Semi-Structured Data + - Structured Data + - Time-Series Data + - Geospatial Data + - Graph Data + - Text Data + - Image Data + - Video Data + - Audio Data + - Sensor Data + - Log Data + - Social Media Data + - Machine Data + - Internet of Things (IoT) Data + - SQl + - DDL (Data Definition Language) + - DML (Data Manipulation Language) + - DCL (Data Control Language) + - TCL (Transaction Control Language) + - e.g. + - CREATE TABLE + - INSERT INTO + - SELECT + - UPDATE + - DELETE + - GRANT + - REVOKE + - COMMIT + - ROLLBACK + - SAVEPOINT + - e.g. SQL Queries + - SELECT * FROM Customers; + - SELECT Name, Age FROM Employees WHERE Department = 'Sales'; + - INSERT INTO Orders (CustomerID, OrderDate, TotalAmount) VALUES (1, '2023-01-01', 100.00); + - UPDATE Products SET Price = 19.99 WHERE ProductID = 1; + - DELETE FROM Customers WHERE CustomerID = 1; + - CREATE TABLE Employees (EmployeeID INT PRIMARY KEY, Name VARCHAR(50), Age INT, Department VARCHAR(50)); + - ALTER TABLE Customers ADD COLUMN Email VARCHAR(100); + - DROP TABLE Orders; + - GRANT SELECT ON Employees TO User1; + - REVOKE SELECT ON Employees FROM User1; + - COMMIT; + - ROLLBACK; + - SAVEPOINT Savepoint1; + - NoSQL + - Document Stores + - Key-Value Stores + - Column-Family Stores + - Graph Databases + - Time-Series Databases + - Object-Oriented Databases + - Multi-Model Databases + - In-Memory Databases + - Distributed Databases + - Cloud-Based Databases + - NewSQL Databases + - Data Warehousing + - process of collecting, storing, and managing large volumes of data from various sources to support business intelligence and analytics activities. + - Data warehousing involves the extraction, transformation, and loading (ETL) of data from operational systems into a central repository, + - where it can be organized, optimized, and made available for analysis and reporting. + - Key components of a data warehouse include: + - Data Sources: Operational systems, external data sources, and other databases that provide the raw data for the data warehouse. + - ETL Process: The process of extracting data from source systems, transforming it into a consistent format, and loading it into the data warehouse. + - Data Storage: The central repository where the data is stored, typically organized in a star schema or snowflake schema for efficient querying and analysis. + - Metadata: Information about the data, such as its source, structure, and usage, which helps users understand and navigate the data warehouse. + - Data Access Tools: Tools and interfaces that allow users to query, analyze, and visualize the data in the data warehouse, such as SQL clients, business intelligence tools, and data visualization platforms. + - Common use cases for data warehousing include: + - Business Intelligence: Data warehouses provide a centralized repository of data that can be used for reporting, analysis, and decision-making activities. + - Data Analytics: Data warehouses enable organizations to perform complex analytics and gain insights from their data, such as customer behavior analysis, sales forecasting, and market research. + - Data Integration: Data warehouses can integrate data from multiple sources, providing a unified view of the organization's data assets and enabling cross-functional analysis. + - Data Governance: + - Data warehouses can support data governance initiatives by providing a centralized repository + - for data management, quality control, and compliance activities. + - Data Archiving: + - Data warehouses can serve as a long-term storage solution for historical data, + - allowing organizations to retain and access data for regulatory compliance, auditing, + - and historical analysis purposes. + - Data Model Terminology: Logical Model vs Physical Model + - Entity: Table + - Attribute: Column + - Row: Record + - Relationship: + - Primary Key + - Foreign Key + - Composite Key + - Surrogate Key + - Candidate Key + - Alternate Key + - Unique Key + - Index + - View + - Stored Procedure + - Function + - Trigger + - Schema + - Database Administration + - DBA: Database Administrator + - responsibilities of DBA include: + - Database Installation and Configuration + - Database Security Management + - Database Backup and Recovery + - Database Performance Tuning + - Database Monitoring and Maintenance + - Database User Management + - Database Capacity Planning + - Database Upgrades and Patching + - Database Troubleshooting and Issue Resolution + - Database Documentation and Reporting + - Database Compliance and Auditing + - log database activities, + - login attempts, + - failed login attempts, + - data changes, and + - access to sensitive data, + - monitor database access and usage, and + - ensure compliance with regulatory requirements and organizational policies. + - employ principles of least privilege and + - role-based access control (RBAC) to restrict access to sensitive data and database resources, + - ensuring that users only have the necessary permissions to perform their job functions. + - Database Disaster Recovery Planning + - Database High Availability and Replication + - Database Automation and Scripting + - Database Change Management + - Database Vendor Management + - Database Training and Support + - Database Architecture and Design + - Database Development and Testing + - Database Data Modeling and Schema Design + - reducing joins, + - combining tables, + - denormalization, + - Database Query Optimization + - Database Indexing and Partitioning + - Ways to write SQL code: + - SQL Clients + - Integrated Development Environments (IDEs) + - Command-Line Interfaces (CLIs) + - Database Management Tools + - Data Visualization Tools + - Business Intelligence Tools + - Cloud-Based Database Services + - DAO: Data Access Object + - design pattern that provides an abstract interface to a database or other persistence mechanism. + - encapsulates the logic for accessing and manipulating data, + - allowing for separation of concerns and improved maintainability of the application code. + - typically includes methods for performing CRUD (Create, Read, Update, Delete) operations on the underlying data store, + - as well as methods for managing database connections and transactions. + - Benefits of using a DAO include: + - Abstraction: DAOs provide a layer of abstraction between the application code and the underlying data store, allowing for flexibility and ease of maintenance. + - Reusability: DAOs can be reused across different parts of the application, reducing code duplication and improving consistency. + - Testability: DAOs can be easily tested in isolation, allowing for better unit testing and improved code quality. + - SQL can be stored in form of: + - Embedded SQL in application code + - dedicated class as DAO (Data Access Object) or Repository + - needs to be compiled and deployed with the application code + - e.g. CustomerDAO.java, OrderDAO.java, ProductDAO.java, etc. + - needs downtime for changes and updates + - Prone to SQL injection attacks + - Dynamic SQL + - Network latency, as the SQL code is executed on the application server and requires communication with the database server, which can lead to slower performance compared to stored procedures. + - Stored Procedures + - Dedicated by function or business logic + - Executable on the database server(RDBMS) + - e.g. CustomerDAO, OrderDAO, ProductDAO, etc. + - can be called from application code or other stored procedures + - no need to compile and deploy with the application code, + - as they are stored and executed on the database server + - no downtime + - Less prone to SQL injection attacks, + - as the SQL code is stored and executed on the database server, + - rather than being embedded in the application code. + - Not dynamic, as they are precompiled and stored on the database server, + - which can limit their flexibility and adaptability to changing business requirements. + - preferred method: access database objects through stored procedures, + - as it provides better security, maintainability, and performance compared to embedding SQL in application code. + - By using stored procedures, you can centralize your database logic, + - making it easier to manage and update as needed, + - without requiring changes to the application code. + - offers layer of abstraction between the application code and the database, + - allowing for better separation of concerns and improved maintainability of the application code. + - By encapsulating the database logic within stored procedures, you can reduce the risk of SQL injection attacks and improve the overall security of your application. + - Additionally, stored procedures can be optimized for performance, + - as they are precompiled and stored on the database server, + - which can lead to faster execution times compared to embedding SQL in application code. + - processed on the database server, + - which can reduce network latency and improve performance compared to embedding SQL in application code, + - as the SQL code is executed directly on the database server, + - rather than requiring communication between the application server and the database server. + - This can lead to faster execution times and improved overall performance of your application. + - Business Intelligence + - process of analyzing data and presenting actionable information to help executives, managers, and other corporate end users make informed business decisions. + - Business intelligence encompasses a variety of tools, applications, and methodologies that enable organizations to collect data from internal systems and external sources, prepare it for analysis, run queries against the data, and create reports, dashboards, and data visualizations to make the analytical results available to decision-makers. + - Key components of business intelligence include: + - Data Warehousing: The process of collecting, storing, and managing large volumes of data from various sources to support business intelligence activities. + - Data Integration: The process of combining data from different sources into a unified view for analysis and reporting. + - Data Quality Management: Ensuring the accuracy, completeness, consistency, and reliability of the data used for business intelligence. + - Data Analysis: The process of examining data to uncover patterns, trends, and insights that can inform business decisions. + - Reporting and Visualization: The creation of reports, dashboards, and visualizations to communicate the results of data analysis to stakeholders. + - Common use cases for business intelligence include: + - Sales Analysis: Analyzing sales data to identify trends, measure performance against targets, and make informed decisions about sales strategies. + - Customer Analysis: Analyzing customer data to understand behavior, preferences, and demographics for targeted marketing and improved customer service. + - Financial Analysis: Analyzing financial data to monitor performance, manage budgets, and make informed financial decisions. + - Operational Analysis: Analyzing operational data to optimize processes, improve efficiency, and reduce costs. + - Market Analysis: Analyzing market data to identify opportunities, assess competition, and inform strategic planning. + - Supply Chain Analysis: Analyzing supply chain data to optimize inventory management, improve supplier performance, and reduce lead times. + - Human Resources Analysis: Analyzing HR data to manage workforce performance, identify talent gaps, and inform recruitment and retention strategies. + - Risk Analysis: Analyzing risk data to identify potential threats, assess vulnerabilities, and inform risk management strategies. + - Product Analysis: Analyzing product data to understand performance, identify opportunities for improvement, and inform product development decisions. + - Social Media Analysis: Analyzing social media data to understand customer sentiment, monitor brand reputation, and inform marketing strategies. + - BI developer role + - BI tools include: + - Microsoft Power BI + - Business Objects (BO) Crystal Reports + - Crystal Reports: XML + - Microsoft SQL Server Reporting Services (SSRS) + - MS-SSRS: C# + - Microsoft SQL Server Analysis Services (SSAS) + - Microsoft SQL Server Integration Services (SSIS) + - Tableau + - Tableau SDK: Python, R, Java, C++, C + - QlikView + - Looker + - Domo + - Sisense + - MicroStrategy + - IBM Cognos Analytics + - SAP BusinessObjects + - Oracle Business Intelligence + - TIBCO Spotfire + - Yellowfin BI + - Zoho Analytics + - Google Data Studio + - Amazon QuickSight + - Salesforce Einstein Analytics + - Alteryx + - DataRobot + - ThoughtSpot + - Mode Analytics + - Chartio + - Periscope Data + - Redash + - Metabase + - Apache Superset + - Grafana + - Kibana + - D3.js + - Highcharts + - Plotly + - Matplotlib + - Seaborn + - ggplot2 + - Bokeh + - Leaflet + - Mapbox + - Carto + - Kepler.gl + - ArcGIS + - QGIS + - GeoPandas + - PostGIS + - GeoServer + - MapServer + - OpenLayers + - CesiumJS + - MapLibre GL JS + - Mapbox GL JS + - Google Maps API + - HERE Maps API + - Bing Maps API + - MapQuest API + - TomTom Maps API + - Mapillary API + - OpenStreetMap API + - Open Source Enterprise Content Management Systems (ECMS) + - Joombla + - Plone + - Magneto + - Wiki + - Alfresco + - Nuxeo + - Liferay + - OpenText + - SharePoint + - Documentum + - M-Files + - Box + - Dropbox Business + - Google Drive for Work + - OneDrive for Business + - EDA (Exploratory Data Analysis) + - process of analyzing and summarizing datasets to gain insights, identify patterns, and understand the underlying structure of the data. + - EDA is an essential step in the data analysis process, as it helps analysts and data scientists to explore the data, identify potential issues, and formulate hypotheses for further analysis. + - Key components of EDA include: + - Data Cleaning: The process of identifying and correcting errors, inconsistencies, and missing values in the dataset to ensure the quality of the data. + - Data Visualization: The creation of visual representations of the data, such as charts, graphs, and plots, to help identify patterns, trends, and relationships within the data. + - Statistical Analysis: The application of statistical techniques to summarize and describe the characteristics of the dataset, such as measures of central tendency, variability, and correlation. + - Feature Engineering: The process of creating new features or variables from existing data to improve the performance of machine learning models. + - Common use cases for EDA include: + - Understanding Data Distribution: Analyzing the distribution of variables in the dataset to identify skewness, outliers, and other characteristics that may impact analysis. + - Identifying Relationships: Exploring relationships between variables to identify correlations, dependencies, and potential causal relationships. + - Detecting Outliers: Identifying outliers in the dataset that may indicate errors or unusual observations that require further investigation. + - Formulating Hypotheses: Using insights gained from EDA to formulate hypotheses for further analysis or modeling. + - Informing Data Preprocessing: Using insights from EDA to inform decisions about data preprocessing steps, such as handling missing values or encoding categorical variables. + - Statistical data analytics + - Data Science + - Predictive Modeling + - Paid packages and tools for predictive modeling include: + - IBM SPSS Modeler + - SAS Enterprise Miner + - RapidMiner + - KNIME + - DataRobot + - H2O.ai + - Alteryx + - Microsoft Azure Machine Learning Studio + - Google Cloud AutoML + - Amazon SageMaker + - Packages and tools for predictive modeling include: + - scikit-learn + - TensorFlow + - Keras + - PyTorch + - XGBoost + - LightGBM + - CatBoost + - Caret + - Machine Learning algorithms + - Supervised Learning + - Linear Regression + - Logistic Regression + - Decision Trees + - Random Forests + - Support Vector Machines (SVM) + - Naive Bayes + - K-Nearest Neighbors (KNN) + - Gradient Boosting Machines (GBM) + - XGBoost + - LightGBM + - CatBoost + - Unsupervised Learning + - K-Means Clustering + - Hierarchical Clustering + - DBSCAN + - Principal Component Analysis (PCA) + - t-Distributed Stochastic Neighbor Embedding (t-SNE) + - Autoencoders + - Reinforcement Learning + - Q-Learning + - Deep Q-Networks (DQN) + - Policy Gradient Methods + - Actor-Critic Methods + - Proximal Policy Optimization (PPO) + - Trust Region Policy Optimization (TRPO) + - Deep Learning + - Convolutional Neural Networks (CNNs) + - Recurrent Neural Networks (RNNs) + - Long Short-Term Memory (LSTM) Networks + - Generative Adversarial Networks (GANs) + - Transformer Models + - BERT (Bidirectional Encoder Representations from Transformers) + - GPT (Generative Pre-trained Transformer) + - Ensemble Learning + - Bagging (Bootstrap Aggregating) + - Boosting + - Stacking + - Voting + - Blending + - Knowledge discovery and data mining(KDD) + - Data Visualization(Data Viz) + - Data Storytelling + - Infographics + - tools for creating infographics include: + - Canva + - Piktochart + - Venngage + - Infogram + - Easel.ly + - Visme + - Adobe Spark + - Snappa + - Biteable + - Animaker + - Powtoon + - Crello + - Vyond + - Moovly + - Renderforest + - Animatron + - Wideo + - Toonly + - Doodly + - VideoScribe + - Explaindio + - Camtasia + - ScreenFlow + - OBS Studio + - Filmora + - Adobe Premiere Pro + - Final Cut Pro + - iMovie + - DaVinci Resolve + - HitFilm Express + - Lightworks + - Shotcut + - OpenShot + - Blender + - VSDC Free Video Editor + - Kdenlive + - Avidemux + - VideoPad Video Editor + - Olive Video Editor + - Libraries and tools for data visualization include: + - Matplotlib + - Seaborn + - ggplot2 + - Bokeh + - Plotly + - D3.js + - Highcharts + - Leaflet + - Mapbox + - Carto + - Kepler.gl + - ArcGIS + - QGIS + - GeoPandas + - PostGIS + - GeoServer + - MapServer + - OpenLayers + - CesiumJS + - MapLibre GL JS + - Mapbox GL JS + - Google Maps API + - HERE Maps API + - Bing Maps API + - MapQuest API + - TomTom Maps API + - Mapillary API + - OpenStreetMap API + - Big Data + - 5 V's of Big Data + - Volume + - large volumes of data are generated from various sources, + - such as social media, sensors, and transactional systems, + - requiring scalable storage and processing solutions to manage and + - analyze the data effectively. + - Velocity + - large volumes of data are generated at high speeds, + - requiring real-time processing and analysis to derive insights and make informed decisions. + - Variety + - Structured Data + - Unstructured Data + - Semi-Structured Data + - Veracity + - veracity refers to the quality and reliability of the data, + - ensuring that the data is accurate, consistent, and trustworthy for analysis and decision-making. + - Variability + - Data from different sources can have varying formats, structures, and quality, + - requiring data integration and cleansing processes to ensure consistency and reliability for analysis. +- RACI table + - Responsible: The person or team responsible for performing the task or activity. + - Accountable: The person who is ultimately accountable for the successful completion of the task or activity. + - Consulted: The person or team who provides input, advice, or expertise during the execution of the task or activity. + - Informed: The person or team who needs to be kept informed about the progress and outcomes of the task or activity. +- Data Product Requirements + - Integrated model between sales and employee shifts + - Ability to identify most popular products and their sales trends + - Ability to analyze sales performance by employee shifts + - Ability to identify peak sales hours and optimize employee scheduling accordingly + - Ability to forecast future sales based on historical data and employee shift patterns + - Ability to generate reports and visualizations to support decision-making and strategic planning + - Ability to integrate with existing sales and employee scheduling systems for seamless data flow and analysis + - Ability to ensure data quality and accuracy for reliable analysis and insights + - Ability to provide actionable insights and recommendations for improving sales performance and optimizing employee scheduling based on the integrated model. + - Ability to manipulate information through user-friendly dashboards +- Data Zone Management + - Data Zone + - Resource boundaries for conrolling access to data and ensuring data security and compliance. + - Data Zone boundary types: + - Physical Boundaries: Physical barriers that restrict access to data, such as locked rooms, secure facilities, or restricted areas. + - Logical Boundaries: Virtual boundaries that control access to data based on user roles, permissions, or authentication mechanisms. + - Network Boundaries: Boundaries that control access to data based on network segmentation, firewalls, or virtual private networks (VPNs). + - Application Boundaries: Boundaries that control access to data based on application-level security measures, such as user authentication and authorization. + - Data Classification Boundaries: Boundaries that control access to data based on its classification level, such as public, internal, confidential, or restricted. + - Regulatory Boundaries: Boundaries that control access to data based on regulatory requirements and compliance standards, such as GDPR, HIPAA, or PCI DSS. + - Organizational Boundaries: Boundaries that control access to data based on organizational policies, procedures, and governance frameworks. + - Temporal Boundaries: Boundaries that control access to data based on time-based restrictions, such as access during specific hours or for a limited duration. + - Geographical Boundaries: Boundaries that control access to data based on geographical location, such as access restrictions for data stored in specific regions or countries. + - Data Sensitivity Boundaries: Boundaries that control access to data based on its sensitivity level, such as personally identifiable information (PII), financial data, or intellectual property. + - Data Retention Boundaries: Boundaries that control access to data based on its retention period, such as access restrictions for data that is archived or scheduled for deletion. + - Data Zone types: + - Public Data Zone: A data zone that contains data that is publicly available and can be accessed by anyone without restrictions. + - Internal Data Zone: A data zone that contains data that is intended for internal use within an organization and may have access restrictions based on user roles or permissions. + - Confidential Data Zone: A data zone that contains sensitive data that requires strict access controls and may only be accessible to authorized personnel with specific permissions. + - Restricted Data Zone: A data zone that contains highly sensitive data that requires the highest level of security and access controls, and may only be accessible to a limited number of authorized personnel with specific permissions. + - Regulated Data Zone: A data zone that contains data that is subject to regulatory requirements and compliance standards, and may require specific access controls and security measures to ensure compliance. + - Archive Data Zone: A data zone that contains data that is no longer actively used but needs to be retained for historical or compliance purposes, and may have access restrictions based on its retention period. + - Test Data Zone: A data zone that contains data that is used for testing and development purposes, and may have access restrictions based on its sensitivity and intended use. + - Backup Data Zone: A data zone that contains backup copies of data for disaster recovery purposes, and may have access restrictions based on its criticality and intended use. + - Data Lake Zone: A data zone that contains raw, unprocessed data from various sources, and may have access restrictions based on its sensitivity and intended use for data analysis and exploration. + - Data Warehouse Zone: A data zone that contains structured, processed data that is optimized for reporting and analysis, and may have access restrictions based on its sensitivity and intended use for business intelligence and decision-making. + - Data Mart Zone: A data zone that contains a subset of data from a data warehouse that is tailored for specific business needs or departments, and may have access restrictions based on its sensitivity and intended use for specific business functions or teams. + - Data Sharing Zone: A data zone that contains data that is shared with external partners, vendors, or customers, and may have access restrictions based on data sharing agreements and security requirements for external data sharing. + - Data Science Zone: A data zone that contains data that is used for data science and machine learning purposes, and may have access restrictions based on its sensitivity and intended use for data analysis and modeling. + - Data Governance Zone: A data zone that contains data that is used for data governance and compliance purposes, and may have access restrictions based on its sensitivity and intended use for data governance activities, such as data quality management, data lineage tracking, and regulatory compliance monitoring. + - Management Data Zone: A data zone that contains data that is used for managing and monitoring the data zones, such as access logs, audit trails, and metadata about the data zones themselves, and may have access restrictions based on its sensitivity and intended use for data zone management activities. + - Sales Data Zone: A data zone that contains sales-related data, such as customer information, sales transactions, and product details, and may have access restrictions based on its sensitivity and intended use for sales analysis and reporting. + - HR Data Zone: A data zone that contains human resources-related data, such as employee information, payroll data, and performance metrics, and may have access restrictions based on its sensitivity and intended use for HR analysis and reporting. + - Sales Data Zone: A data zone that contains sales-related data, such as customer information, sales transactions, and product details, and may have access restrictions based on its sensitivity and intended use for sales analysis and reporting. + - Lakehouse + - A data architecture that combines the benefits of data lakes and data warehouses, providing a unified platform for storing, processing, and analyzing large volumes of structured and unstructured data. + - Key features of a lakehouse include: + - Unified Data Storage: A lakehouse provides a single storage layer that can handle both structured and unstructured data, allowing for greater flexibility in data management and analysis. + - Schema Enforcement: A lakehouse enforces a schema on the data, ensuring that it is organized and structured for efficient querying and analysis, while still allowing for the flexibility of a data lake. + - ACID Transactions: A lakehouse supports ACID (Atomicity, Consistency, Isolation, Durability) transactions, ensuring data integrity and reliability during data processing and analysis. + - Data Governance: A lakehouse provides robust data governance capabilities, including access controls, data lineage tracking, and metadata management, to ensure data security and compliance. + - Scalability: A lakehouse is designed to scale horizontally, allowing for the storage and processing of large volumes of data as the organization's needs grow. + - Performance: A lakehouse is optimized for performance, providing fast query execution and low latency for data analysis and reporting + - Lakehouse principles include: + - Data as a Service: A lakehouse treats data as a service, providing a unified platform for storing, processing, and analyzing data across the organization. + - Schema on Read: A lakehouse allows for schema on read, meaning that the schema is applied to the data at the time of query execution, providing flexibility in data management and analysis. + - Open Data Formats: A lakehouse uses open data formats, such as Parquet or ORC, to ensure compatibility with a wide range of tools and technologies for data processing and analysis. + - Metadata Management: A lakehouse provides robust metadata management capabilities, allowing for the organization and discovery of data assets within the lakehouse. + - Data Lineage: A lakehouse tracks data lineage, providing visibility into the origin and transformation of data within the lakehouse for better understanding and governance. + - Data Governance: A lakehouse provides comprehensive data governance capabilities, including access controls, auditing, and compliance monitoring, to ensure data security and regulatory compliance. + - Scalability: A lakehouse is designed to scale horizontally, allowing for the storage and processing of large volumes of data as the organization's needs grow. + - Performance: A lakehouse is optimized for performance, providing fast query execution and low latency for data analysis and reporting, even with large volumes of data. + - Data Mesh + - Decentralized data architecture that promotes a domain-oriented approach to data management and governance, + - where data is treated as a product and owned by cross-functional teams responsible for + - its quality, availability, and usability. + - Key principles of data mesh include: + - Domain-Oriented Data Ownership: + - Data is owned and managed by cross-functional teams + - that are responsible for the quality, availability, and usability of the data + - within their domain, rather than a centralized data team. + - Data as a Product: + - Data is treated as a product, with clear ownership, defined interfaces, and + - a focus on delivering value to end-users and stakeholders. + - Self-Serve Data Infrastructure: + - Data infrastructure is designed to be self-serve, allowing teams to easily access and + - manage their data without relying on a centralized data team for support. + - Federated Computational Governance: + - Governance is distributed across the organization, with clear policies and + - guidelines for data management and usage, while still allowing for flexibility and + - autonomy for individual teams. + - Data Discoverability: + - Data is easily discoverable and accessible, + - with clear metadata and documentation to help users find and + - understand the data they need for analysis and decision-making. + - Data Interoperability: + - Data is designed to be interoperable, with standardized formats and + - interfaces that allow for seamless integration and collaboration across teams and domains. + - Data Quality: + - Data quality is a shared responsibility, with teams accountable for ensuring the accuracy, + - completeness, and reliability of their data, and with processes in place for monitoring and + - improving data quality over time. + - Data Security and Privacy: + - Data security and privacy are prioritized, with clear policies and controls in place to + - protect sensitive data and ensure compliance with regulatory requirements, + - while still allowing for the necessary access and collaboration across teams and domains. + - Data Scalability: + - Data architecture is designed to scale horizontally, allowing for the storage and + - processing of large volumes of data as the organization's needs grow, + - while still maintaining performance and usability for end-users and stakeholders. + - Data Evolution: + - Data architecture is designed to evolve over time, + - allowing for the addition of new data sources, + - changes in data structure, and + - updates to data governance policies as the organization's needs and priorities change. + - Organization Data Mesh structure + - Shared Services + - Identity + - Networking + - Monitoring + - Security + - HR Data Products + - Finance Data Products + - Operations Data Products + - Sales Data Products + - Security Components + - Network Security: Measures to protect the network infrastructure, such as firewalls, intrusion detection systems, and virtual private networks (VPNs). + - Endpoint Security: Measures to protect individual devices, such as antivirus software, endpoint detection and response (EDR) solutions, and mobile device management (MDM) tools. + - Application Security: Measures to protect applications, such as secure coding practices, application firewalls, and vulnerability assessments. + - Data Security: Measures to protect data, such as encryption, access controls, and data loss prevention (DLP) solutions. + - Identity and Access Management (IAM): Measures to manage user identities and control access to resources, such as multi-factor authentication (MFA), single sign-on (SSO), and role-based access control (RBAC). + - Security Information and Event Management (SIEM): Measures to monitor and analyze security events, such as log management, threat intelligence, and incident response capabilities. + - Cloud Security: Measures to protect cloud-based resources, such as cloud access security brokers (CASBs), cloud workload protection platforms (CWPPs), and cloud security posture management (CSPM) tools. + - Physical Security: Measures to protect physical assets, such as access controls, surveillance systems, and security personnel. + - Security Awareness Training: Measures to educate employees about security best practices, such as phishing awareness, password hygiene, and social engineering prevention. + - Incident Response: Measures to respond to security incidents, such as incident response plans, incident response teams, and incident response tools. + - Compliance and Regulatory Requirements: Measures to ensure compliance with relevant laws and regulations, such as GDPR, HIPAA, PCI DSS, and other industry-specific standards. + - Security Auditing and Testing: Measures to assess the effectiveness of security controls, such as penetration testing, vulnerability assessments, and security audits. + - Security Governance: Measures to establish and enforce security policies, procedures, and frameworks, such as ISO 27001, NIST Cybersecurity Framework, and CIS Controls. + - Security Metrics and Reporting: Measures to track and report on security performance, such as key performance indicators (KPIs), security dashboards, and regular security reporting to stakeholders. + - Security Culture: Measures to foster a culture of security awareness and responsibility within the organization, such as leadership support, employee engagement, and recognition programs for security best practices. + - Security Incident Management: Measures to manage and mitigate security incidents, such as incident response plans, incident response teams, and incident response tools. + - Security Automation and Orchestration: Measures to automate security processes and responses, such as security orchestration, automation, and response (SOAR) platforms, and security automation tools for tasks like threat detection, incident response, and vulnerability management. + - Access controls + - use of identity providers + - Active Directory (AD) + - LDAP (Lightweight Directory Access Protocol) + - SAML (Security Assertion Markup Language) + - OAuth (Open Authorization) + - OpenID Connect (OIDC) + - RADIUS (Remote Authentication Dial-In User Service) + - Group level access controls + - User Groups + - Role-Based Access Control (RBAC) + - Attribute-Based Access Control (ABAC) + - Discretionary Access Control (DAC) + - Mandatory Access Control (MAC) + - Access Control Lists (ACLs) + - User level access controls + - Multi-Factor Authentication (MFA) + - Single Sign-On (SSO) + - Privileged Access Management (PAM) + - Role-Based Access Control (RBAC) + - Advanced Access Control Models + - based on + - geography + - time etc. + - Attribute-Based Access Control (ABAC) + - Discretionary Access Control (DAC) + - Mandatory Access Control (MAC) + - Access Control Lists (ACLs) + - Multi-Factor Authentication (MFA) + - Single Sign-On (SSO) + - Privileged Access Management (PAM) + - Identity and Access Management (IAM) Solutions + - Okta + - Microsoft Azure Active Directory (Azure AD) + - Ping Identity + - Auth0 + - OneLogin + - Centrify + - CyberArk + - SailPoint + - RSA SecurID + - Duo Security + - LastPass + - 1Password + - Bitwarden + - Keeper Security + - Thycotic + - BeyondTrust + - HashiCorp Vault + - AWS Identity and Access Management (IAM) + - Google Cloud Identity and Access Management (IAM) + - Data Encryption + - Symmetric Encryption + - AES (Advanced Encryption Standard) + - DES (Data Encryption Standard) + - 3DES (Triple Data Encryption Standard) + - RC4 (Rivest Cipher 4) + - Blowfish + - Twofish + - Asymmetric Encryption + - RSA (Rivest-Shamir-Adleman) + - ECC (Elliptic Curve Cryptography) + - DSA (Digital Signature Algorithm) + - Diffie-Hellman Key Exchange + - Hashing Algorithms + - SHA-256 (Secure Hash Algorithm 256-bit) + - SHA-1 (Secure Hash Algorithm 1) + - MD5 (Message Digest Algorithm 5) + - bcrypt + - scrypt + - Argon2 + - Encryption Protocols + - TLS (Transport Layer Security) + - SSL (Secure Sockets Layer) + - IPSec (Internet Protocol Security) + - SSH (Secure Shell) + - PGP (Pretty Good Privacy) + - Key Management + - Key Management Systems (KMS) + - Hardware Security Modules (HSMs) + - Key Vaults + - Key Rotation + - Key Backup and Recovery + - Key Access Controls + - Key Usage Policies + - Key Auditing and Monitoring + - Key Lifecycle Management + - Key Distribution and Sharing + - Key Generation and Storage + - Key Revocation and Expiration + - Key Escrow and Recovery + - Key Management Best Practices + - Security Monitoring + - DataOps + - DataOps is a set of practices and tools that aim to improve the speed, quality, and reliability of data analytics and data science projects by applying principles from DevOps to the data lifecycle. + - DataOps focuses on collaboration, automation, and continuous improvement across the entire data pipeline, from data ingestion and processing to analysis and visualization. + - Key principles of DataOps include: + - Collaboration: DataOps emphasizes collaboration between data engineers, data scientists, analysts, and other stakeholders to ensure that everyone is aligned on project goals, requirements, and timelines. + - Automation: DataOps promotes the use of automation tools and processes to streamline data workflows, reduce manual errors, and increase efficiency in data processing and analysis. + - Continuous Integration and Continuous Deployment (CI/CD): DataOps encourages the use of CI/CD practices to enable rapid development, testing, and deployment of data pipelines and analytics solutions. + - Monitoring and Feedback: DataOps emphasizes the importance of monitoring data pipelines and analytics solutions in real-time to identify issues early and provide feedback for continuous improvement. + - Version Control: DataOps promotes the use of version control systems for managing code, configurations, and documentation related to data projects to ensure reproducibility and traceability. + - Data Quality Management: DataOps emphasizes the importance of maintaining high data quality through automated testing, validation, and monitoring processes throughout the data lifecycle. + - Scalability: DataOps encourages the design of scalable data architectures and workflows that can handle increasing volumes of data as projects grow. + - Security and Compliance: DataOps promotes the implementation of security best practices and compliance measures to protect sensitive data throughout the data lifecycle. + - Data Governance: DataOps emphasizes the importance of data governance practices to ensure that data is properly managed, documented, and accessible to the right stakeholders while maintaining data privacy and security. + - Continuous Learning and Improvement: DataOps encourages a culture of continuous learning and improvement, where teams regularly review their processes, tools, and outcomes to identify areas for enhancement and optimize their data workflows over time. + - DataOps Practices: + - Automation + - Testing and Validation + - Version Control + - Monitoring and Feedback + - CI/CD + - Change and Release Management + - Change Management: The process of managing changes to data systems, processes, and workflows to ensure that changes are implemented smoothly and with minimal disruption to business operations. + - Release Management: The process of planning, scheduling, and controlling the deployment of changes to data systems, processes, and workflows to ensure that releases are delivered on time and with minimal risk to business operations. + - Key components of Change and Release Management include: + - Change Request Management: The process of submitting, reviewing, and approving change requests for data systems, processes, and workflows, including impact assessment and risk analysis. + - Change Implementation: The process of implementing approved changes to data systems, processes, and workflows, including testing, validation, and documentation of changes. + - Release Planning: The process of planning and scheduling releases of changes to data systems, processes, and workflows, including coordination with stakeholders and communication of release schedules. + - Release Deployment: The process of deploying approved changes to data systems, processes, and workflows, including monitoring and validation of releases to ensure successful deployment and minimal disruption to business operations. + - Change and Release Documentation: The process of documenting changes and releases to data systems, processes, and workflows, including change logs, release notes, and post-implementation reviews to capture lessons learned and improve future change and release management processes. + - Change and Release Communication: The process of communicating changes and releases to stakeholders, including end-users, management, and other relevant parties, to ensure awareness and understanding of changes and releases and to manage expectations and minimize resistance to change. + - Change and Release Governance: The process of establishing governance structures and policies for managing changes and releases to data systems, processes, and workflows, including roles and responsibilities, approval processes, and compliance requirements to ensure that changes and releases are managed effectively and in accordance with organizational policies and standards. + - Change and Release Metrics: The process of tracking and analyzing metrics related to change and release management, such as change success rates, release frequency, and post-release incident rates, to evaluate the effectiveness of change and release management processes and identify areas for improvement. + - Change and Release Automation: The process of automating change and release management processes, such as change request submission, approval workflows, testing, and deployment, to increase efficiency and reduce manual errors in change and release management. + - Change and Release Training: The process of providing training and education to stakeholders involved in change and release management, including change request submitters, approvers, implementers, and end-users, to ensure that they understand their roles and responsibilities in the change and release management process and are equipped with the necessary skills and knowledge to effectively manage changes and releases to data systems, processes, and workflows. + - Rollback and Recovery: The process of planning and executing rollback procedures in case of failed changes or releases, including backup and recovery strategies to minimize downtime and data loss in the event of a failed change or release. + - Types of changes: + - Data source changes + - Model or transformation changes + - System configuration changes + - Security and access control changes + - Architectural changes + - Process and workflow changes + - Tool and technology changes + - Compliance and regulatory changes + - Performance optimization changes + - Bug fixes and issue resolution changes + - User interface and experience changes + - Data quality improvement changes + - Data governance and documentation changes + - Data retention and archiving changes + - Data integration and interoperability changes + - Data visualization and reporting changes + - Data science and machine learning changes + - DataOps and automation changes + - Change and release management process changes + - Performing a release: + - Minimize impact downstream by scheduling releases during low-usage periods or implementing phased rollouts to gradually introduce changes and mitigate potential disruptions to business operations. + - Communicate release schedules and potential impacts to stakeholders in advance to manage expectations and ensure awareness of upcoming changes, allowing for proper planning and preparation. + - Implement robust testing and validation processes to ensure that changes are thoroughly tested and validated before deployment, reducing the risk of issues and minimizing the likelihood of post-release incidents that could impact business operations. + - Establish clear rollback procedures and backup strategies to quickly revert changes in case of issues during or after deployment, minimizing downtime and data loss in the event of a failed release and ensuring business continuity. + - System Stability: Ensure that the release does not introduce instability to the system, which could lead to performance degradation or downtime, by thoroughly testing and validating changes before deployment and closely monitoring the system after release. + - Pipeline transformations are all operational + - Post release Monitor releases closely after deployment to identify and address any issues or incidents that may arise, allowing for prompt resolution and minimizing the impact on business operations. + - Data Normalization + - prime attribute vs non-prime attribute + - Normalization forms(rules for databases): + - 1NF + - each column contains only atomic values + - i.e. no multi-valued attributes + - No repeating groups in an individual table + - separate table for each set of related data + - each set of related data should be identified with a key + - e.g. + - Customer Database + - Customer_info + - Sales_Data + - Exclude: + - Sales_Items + - Item_names, price, quantity from customer table + - Put sales details in sales table + - 2NF + - already in 1NF + below: + - tackles partial dependencies + - No prime-attributes relies solely on segment of primary key + - all attributes are dependent on entire primary key rather than part of it + - values of the columns are interdependent + - e.g. + - Customer Information DB + - Customer related data + - First_name, Last_name, address, city, state, zip_code + - customer_id + - 3NF + - already in 2NF + below: + - removal of fields independent of key + - non-prime attributes are not dependent on other non-prime attributes + - tacles transitive functional dependencies (indirect dependencies) + - e.g. + - Customer DB + - state or country of an individual + - can be same for multiple individuals + - dont rely on key for unique identification + - dont store in customer data + - create separate tables for states and countries + - 4NF + - 5NF + - Advantages of Normalization + - reduces redundancies + - eliminates duplicate entries + - prevent inefficiencies and inconsistencies + - simplifies data management + - reduces complexity + - minimizes risk of error + - enhances scalability and performance + - reduces redundancy + - organizes data efficiently + - ensures databases handle increasing volumes effectively + - Data Validation Techniques + - Only numeric values accepted for phone number + - Data lies within the range + - confirm primary keys and foreign keys are valid + - e.g. incorrect product id must not be allowed + - ensure data values adhere to predefined domain values + - e.g. green, blue and red options only + - Data Refining + - remove incorrect, irrelevant or incomplete elements + - remove typos + - convert data into consistent formats + - e.g. DD-MM-YYYY + - Identify and Remove duplicates + - Summarise and consolidate data + - Monthly + - Quarterly + - Expand data with additional information + - e.g. mapping user with ip for advertisement + - Data Validation Tools + - Data Profiling + - Identify Missing Values + - Identify Outliers + - Identify Other Data Anomalies + - Tools + - OpenRefine + - open source + - Trifacta Wrangler + - data wrangling + - data cleasing and transformation + - Ataccama + - Atlan + - Data Quality Management + - Informatica EDQ(Enterprise Data Quality) + - IBM Infosphere Information Governance Catalog + - ELT + - Data Visualisation tools + - Tableau + - Microsoft PowerBI +- ERPs and CRMs + - ERP + - Enterprise Resource Planning + - CRM + - Customer Relation Management diff --git a/Entrepreneurship/readme.md b/Entrepreneurship/readme.md new file mode 100644 index 0000000..99b1b0b --- /dev/null +++ b/Entrepreneurship/readme.md @@ -0,0 +1,4 @@ +- Entrepreneurship/Business terms + - ROAS + - Return on Ad Spend + - \ No newline at end of file diff --git a/Linux/RC_Files/readme.md b/Linux/RC_Files/readme.md index 563fb1b..9f9cea1 100644 --- a/Linux/RC_Files/readme.md +++ b/Linux/RC_Files/readme.md @@ -32,5 +32,8 @@ - which provides line-editing and history capabilities for programs like Bash. - Python: pycodestyle and pylintrc are used to configure Python code style and linting settings, respectively. - MySQL: my.cnf is an RC file used to configure the MySQL database server, - - including settings related to performance, security, and replication. - + - including settings related to performance, security, and replication. +- file that is executed when a user logs in to the system. + - used to set environment variables and run commands that should be executed for all users. + - ~/.profile +- \ No newline at end of file diff --git a/Linux/devices_and_mounts.md b/Linux/devices_and_mounts.md new file mode 100644 index 0000000..6159d36 --- /dev/null +++ b/Linux/devices_and_mounts.md @@ -0,0 +1,5 @@ +| Directory | What It Contains | Primary Purpose | Used By | Auto-Created | Example | +|-----------|------------------------------------|-------------------------------|----------------------|-------------------------------------|-----------------------------| +| `/dev` | Device files representing hardware | Interface to hardware devices | Kernel & system | Yes | `/dev/sda`, `/dev/null` | +| `/mnt` | Empty directory for manual mounts | Temporary/manual mounting | System administrator | No | `sudo mount /dev/sdb1 /mnt` | +| `/media` | Mount points for removable media | Auto-mounting USB, CD, etc. | Desktop environment | Yes (subdirs created automatically) | `/media/atul/MyUSB` | \ No newline at end of file diff --git a/Linux/distros.md b/Linux/distros.md new file mode 100644 index 0000000..51c43d7 --- /dev/null +++ b/Linux/distros.md @@ -0,0 +1,114 @@ +- Linux Distros + - Linux distribution + - operating system made from a software collection that is based upon the Linux kernel + - available for a wide variety of systems ranging from + - embedded devices (for example, OpenWrt) and + - personal computers (for example, Ubuntu), + - to powerful supercomputers (for example, Red Hat Enterprise Linux). + - A typical Linux distribution comprises + - Linux kernel, + - GNU tools + - libraries, + - additional software, + - documentation, + - a window system (the most common being X11) + - and a desktop environment (for example GNOME or KDE Plasma). + - free and open-source software under the GNU General Public License, + - applications, + - desktop and laptop computers, + - servers, and + - embedded systems. + - Some of the most popular Linux distributions include: + - Ubuntu: A user-friendly distribution based on Debian, known for its ease of use and large community support. + - Fedora: A cutting-edge distribution sponsored by Red Hat, known for its focus on innovation and the latest + software. + - Debian: A stable and versatile distribution that serves as the basis for many other distributions, including + Ubuntu. + - CentOS: A free and open-source distribution based on Red Hat Enterprise Linux, often used for servers and + enterprise applications. + - Arch Linux: A lightweight and flexible distribution that allows users to build their system from the ground + up, known for its simplicity and customization options. + - Linux Mint: A user-friendly distribution based on Ubuntu, known for its ease of use and attractive desktop + environment. + - openSUSE: A community-driven distribution that offers both a stable release (Leap) and a rolling release ( + Tumbleweed), known for its powerful configuration tools and support for multiple desktop environments. + - Manjaro: A user-friendly distribution based on Arch Linux, known for its ease of use and pre-configured + desktop environments. + - Red Hat Enterprise Linux (RHEL): A commercial distribution designed for enterprise use, known for its + stability, security, and support options. + - Kali Linux: A distribution based on Debian, designed for penetration testing and security auditing, known for + its extensive collection of security tools. + - Xfce: A lightweight desktop environment that can be used with various Linux distributions, known for its speed and + low resource usage, making it suitable for older hardware. + - More + - Pop!_OS: A distribution developed by System76, based on Ubuntu, designed for developers and power users, known + for its focus on productivity and gaming. + - Elementary OS: A distribution based on Ubuntu, known for its elegant and minimalist design, inspired by macOS. + - Zorin OS: A distribution based on Ubuntu, designed to be user-friendly and familiar for users coming from + Windows, + with a focus on ease of use and accessibility. + - Solus: An independent distribution that focuses on providing a cohesive and user-friendly desktop experience, + known for its Budgie desktop environment. + - MX Linux: A distribution based on Debian, known for its stability, performance, and user-friendly interface, + often + recommended for older hardware. + - Deepin: A distribution based on Debian, developed by the Chinese company Deepin Technology, known for its + visually + appealing desktop environment and user-friendly design. + - Peppermint OS: A distribution based on Ubuntu, designed for cloud computing and web applications, known for + its + lightweight and fast performance. + - Slackware: One of the oldest Linux distributions, known for its simplicity and adherence to Unix principles, + often + favored by experienced users who prefer a more hands-on approach to system configuration. + - Gentoo: A distribution that allows users to compile their own software from source code, known for its + flexibility + and performance optimization options, often used by advanced users who want complete control over their + system. + - Alpine Linux: A security-oriented, lightweight distribution based on musl libc and busybox, designed for + use + in containers and embedded systems, known for its small size and minimal attack surface. + - Clear Linux: A distribution developed by Intel, optimized for performance and security, designed for cloud and + edge computing, known for its aggressive optimization techniques and focus on performance. + - These distributions offer a wide range of features and target different user groups, from beginners to + advanced + users, and from general-purpose computing to specialized use cases. The Linux distribution landscape is rich + and + diverse, providing options for virtually every type of user and application. Whether you're looking for a + user-friendly desktop experience, a powerful server platform, or a lightweight distribution for older + hardware, + there's likely a Linux distribution that fits your needs. + - Overall, the choice of a Linux distribution can depend on various factors, including the user's level of + experience, the intended use case, hardware compatibility, and personal preferences. With so many options + available, it's important to research and try out different distributions to find the one that best suits your + needs and preferences. + - NixOS: A distribution that uses the Nix package manager to provide a declarative approach to system + configuration and package management, known for its reproducibility and isolation of software + environments. + - EndeavourOS: A distribution based on Arch Linux, designed to be user-friendly and accessible, with a focus + on + providing a simple installation process and a pre-configured desktop environment. + - Void Linux: An independent distribution that uses the runit init system and the XBPS + - Parrot OS: A distribution based on Debian, designed for security and privacy, with a focus on penetration + testing, digital forensics, and anonymity. + - Tails: A distribution based on Debian, designed for privacy and anonymity, with a focus on secure + communication and data protection, often used by journalists, activists, and individuals in need of strong + privacy measures. + - Qubes OS: A security-focused distribution that uses virtualization to isolate different applications and + tasks, providing a high level of security by compartmentalizing different activities into separate virtual + machines. + - Kali Linux: A distribution based on Debian, designed for penetration testing and ethical hacking, with a + wide + range of tools for security testing and vulnerability assessment. + - BlackArch: A distribution based on Arch Linux, designed for penetration testing and security research, + with a + large collection of tools for ethical hacking and security analysis. + - Alpine Linux: A security-oriented, lightweight distribution based on musl libc and busybox, designed for + use + in containers and embedded systems, known for its small size and minimal attack surface. + - Clear Linux: A distribution developed by Intel, optimized for performance and security, designed for cloud + and edge computing, known for its aggressive optimization techniques and focus on performance. + - more + - Clear Linux: A distribution developed by Intel, optimized for performance and security, designed for + cloud + and edge computing, known for its aggressive optimization techniques and focus on performance. \ No newline at end of file diff --git a/Linux/linuxFileSystem.md b/Linux/linuxFileSystem.md index f746bb9..6dfa576 100755 --- a/Linux/linuxFileSystem.md +++ b/Linux/linuxFileSystem.md @@ -18,6 +18,11 @@ - /etc/init.d - This directory contains the scripts that start various system services at boot time. - /etc/crontab - automated jobs + - /etc/services - network services and their port numbers + - /etc/group - group information for users + - e.g. http 80/tcp, https 443/tcp, ssh 22/tcp, 8080/tcp + - /etc/login.defs - password aging and other login related settings + - /etc/shadow - encrypted password information for users - /bin, /usr/bin: These two directories contain most of the programs for the system. - The /bin directory has the essential programs that the system requires to operate, - while /usr/bin contains applications for the system's users. @@ -57,7 +62,9 @@ - contains devices that are available to the system. - In Linux (like Unix), devices are treated like files. - we can read and write devices as though they were files. - - For example, /dev/fd0 is the first floppy disk drive, /dev/sda is the first hard drive. + - For example, + - /dev/fd0 is the first floppy disk drive, + - /dev/sda is the first hard drive. - All the devices that the kernel understands are represented here. - /dev/fd*: Floppy disk drives. - /dev/hd*: IDE (PATA) disks on older systems. Typical motherboards contain two IDE connectors or channels, @@ -68,10 +75,24 @@ - A trailing digit indicates the partition number on the device. For example, /dev/hda1 refers to the first partition - on the first hard drive on the system, while /dev/hda refers to the entire drive. - /dev/lp*: Printers. - - /dev/sd*: SCSI disks. On modern Linux systems, the kernel treats all disk like devices (including PATA/SATA hard disks, - - flash drives, and USB mass storage devices such as portable music players and digital cameras) as SCSI disks. + - /dev/sd*: SCSI disks. + - On modern Linux systems, the kernel treats all disk like devices (including PATA/SATA hard disks, + - flash drives, and USB mass storage devices such as portable music players and digital cameras) + - as SCSI disks. - The rest of the naming system is similar to the older /dev/hd* naming scheme described above. - - /dev/sr*: Optical drives (CD/DVD readers and burners). + - /dev/sr*: + - Optical drives (CD/DVD readers and burners). + - /dev/null(Black hole file): A special device that discards all data written to it. + - It is often used to suppress output from commands or to provide an empty input source. + - /dev/tty: + - Represents terminal devices. + - Each terminal session is associated with a /dev/tty entry, + - allowing for input and output operations. + - /dev/random and /dev/urandom: + - These are special devices that provide random data. + - /dev/random provides high-quality random data + - but may block if there is not enough entropy, + - while /dev/urandom provides non-blocking random data. - /proc: The /proc directory is also special. - This directory does not contain files. - In fact, this directory does not really exist at all. It is entirely virtual. @@ -102,3 +123,46 @@ - /opt: The /opt directory is used to install “optional” software. - This is mainly used to hold commercial software products that might be installed on the system. - Tip: Using the tail -f /var/log/messages technique is a great way to watch what the system is doing in near real-time. +- /etc/shadow: This file contains the encrypted password information for users. + - It is only readable by the superuser, and is used by the system to verify user passwords when they log in. + - The /etc/passwd file contains a placeholder for the password, but the actual encrypted password is stored in /etc/shadow for security reasons. +- /etc/login.defs: This file contains settings related to user login, such as password aging policies and other login-related configurations. + - It defines how long passwords are valid, when they expire, and other related settings that help manage user accounts and security. +- inodes: + - inode (index node) is a data structure that stores metadata about a file + - but not the file name and not the actual file content. + - stores: + - File type (regular file, directory, symlink, etc.) + - File size + - Owner (UID) + - Group (GID) + - Permissions (rwx) + - Timestamps: + - atime (last access) + - mtime (last modification) + - ctime (metadata change) + - Number of hard links + - Pointers to actual data blocks on disk + - Examples: + - ls -i file.txt + - 123456 file.txt + - 123456 is the inode number for file.txt + - df -i + - hard link creates same inode number for multiple file names + - ln file.txt link.txt + - Linux identifies files internally by inode number, not filename. + - Even if you rename a file, The inode number stays the same. + - Inode Exhaustion Problem + - Sometimes you may see:No space left on device + - But disk space is available! + - Reason: All inodes are used up. + - Common case: + - Too many small files + - /tmp filled with tiny files + - Check + - df -i + - Shows inode usage + - Solution: + - Delete unnecessary files to free inodes + - Clean up /tmp + - Use a filesystem with more inodes if needed (e.g. ext4 with higher inode ratio) diff --git a/Linux/others/distros.md b/Linux/others/linux_basics.md similarity index 69% rename from Linux/others/distros.md rename to Linux/others/linux_basics.md index 7936604..4961302 100644 --- a/Linux/others/distros.md +++ b/Linux/others/linux_basics.md @@ -1,542 +1,388 @@ - -Sure, let's dive into the world of Linux! Linux is an open-source operating system that is widely used for various purposes, from running servers to powering personal computers and even smartphones. It's known for its stability, security, and flexibility. Here's a beginner-friendly tutorial to get you started: - -1. Understanding the Basics: - What is Linux? - Linux is a Unix-like operating system kernel that was first released on September 17, 1991, by Linus Torvalds. It's the heart of what we call a Linux distribution, or simply "Linux" in common language. Popular distributions include Ubuntu, Fedora, Debian, and CentOS. - -Terminal or Command Line Interface (CLI): -In Linux, the command line is a powerful tool for interacting with the system. You'll often hear it referred to as the terminal or CLI. Don't be intimidated; it's quite user-friendly once you get the hang of it. - -File System Hierarchy: -Linux organizes its file system differently from Windows. The root directory ("/") is the top-level directory from which everything branches out. Key directories include /bin (binaries), /home (user directories), /etc (system configuration files), and /var (variable data such as logs). - -2. Getting Started: - Choosing a Distribution: - There are many distributions of Linux, each with its own characteristics and target audiences. Ubuntu is often recommended for beginners due to its user-friendly nature and vast community support. - -Installation: -You can install Linux alongside your existing operating system (dual-boot) or replace it entirely. Most distributions offer live USB/DVD versions, allowing you to test them out before installing. - -3. Basic Commands: - Navigating the File System: - -pwd: Print the current working directory. -ls: List directory contents. -cd: Change directory. -mkdir: Make directory. -rmdir: Remove directory. -cp: Copy files or directories. -mv: Move or rename files or directories. -rm: Remove files or directories. -touch: Create an empty file. -Working with Files: - -cat: Concatenate and display file content. -less or more: View file contents one page at a time. -head and tail: Display the beginning or end of a file, respectively. -grep: Search for patterns within files. -4. User Management: - Adding Users: - -sudo adduser username: Add a new user. -sudo usermod -aG sudo username: Add the user to the sudo group for administrative privileges. -Permissions: -Linux uses a permission system to control access to files and directories. Permissions include read (r), write (w), and execute (x), and they apply to three categories: owner, group, and others. - -Changing Permissions: - -chmod: Change file permissions. -chown: Change file ownership. -5. Package Management: - Installing Software: - Package managers like apt (Ubuntu, Debian) and yum (Fedora, CentOS) make it easy to install, update, and remove software. - -sudo apt update: Update package lists. -sudo apt install packagename: Install a package. -sudo apt remove packagename: Remove a package. -6. System Maintenance: - Updating the System: - Keeping your system up-to-date is crucial for security and performance. - -sudo apt update: Update package lists. -sudo apt upgrade: Upgrade installed packages. -System Information: - -uname -a: Display system information. -df -h: Display disk space usage. -free -h: Display memory usage. -7. Getting Help: - Manual Pages: - Most commands come with manual pages that provide detailed information on their usage. You can access them using the man command, e.g., man ls for information about the ls command. - -Online Resources: -There's a vast amount of online documentation, tutorials, and forums where you can find answers to your questions and learn more about Linux. - -Conclusion: -This tutorial provides a solid foundation for getting started with Linux. As you become more comfortable with the basics, you can explore more advanced topics like shell scripting, networking, system administration, and more. Don't hesitate to experiment and have fun exploring the world of Linux! - - -2 / 2 - - - -User -continue -ChatGPT -Certainly! Let's continue with some more advanced topics in Linux: - -8. Shell Scripting: - What is a Shell Script? - A shell script is a text file containing a series of commands that are executed as a single script. Shell scripts can automate repetitive tasks, manage system configurations, and more. - -Creating a Shell Script: - -Create a new file with a .sh extension, e.g., myscript.sh. -Add your commands to the file, starting with #!/bin/bash (the shebang line) to specify the shell to use. -Make the script executable with chmod +x myscript.sh. -Run the script with ./myscript.sh. -Example Script: - -bash -Copy code -#!/bin/bash -echo "Hello, World!" -9. Text Processing: - Using sed and awk: - -sed is a stream editor for filtering and transforming text. -awk is a powerful text processing tool for pattern scanning and processing. -Example Usage: - -bash -Copy code -# Using sed to replace text -echo "Hello, World!" | sed 's/Hello/Hi/' - -# Using awk to print specific columns -echo "1 2 3" | awk '{print $2}' -10. Managing Processes: - Viewing Processes: - -ps: Display information about processes. -top: Interactive process viewer. -Killing Processes: - -kill: Send a signal to a process. Use kill -9 PID to force kill a process. -11. Networking: - Checking Network Configuration: - -ifconfig or ip addr: Display network interface information. -ping: Check network connectivity to a host. -netstat or ss: Display network connections, routing tables, and more. -Configuring Network Settings: - -ifconfig eth0 192.168.1.10 netmask 255.255.255.0: Configure the IP address of the interface eth0. -route add default gw 192.168.1.1: Add a default gateway. -12. System Administration: - Logs and Monitoring: - -journalctl: View system logs. -dmesg: Display kernel ring buffer messages. -Managing Services: - -systemctl: Control system services. E.g., systemctl start sshd to start the SSH service. -Conclusion: -Linux offers a vast array of tools and utilities for both basic and advanced users. As you continue your Linux journey, you'll discover even more powerful features and possibilities. Experimenting with these tools in a safe environment is a great way to learn and expand your skills. Enjoy exploring the world of Linux! - -User -continue -ChatGPT -Certainly! Let's delve deeper into some more advanced Linux topics: - -13. File Permissions and Ownership: - Understanding Permissions: - Permissions in Linux are represented by three sets of characters: user, group, and others. Each set specifies read (r), write (w), and execute (x) permissions. For example, drwxr-xr-x indicates a directory where the owner has read, write, and execute permissions, while others have only read and execute permissions. - -Changing Permissions: - -chmod: Change file permissions. For example, chmod 755 file.txt sets read, write, and execute permissions for the owner, and read and execute permissions for group and others. -chown: Change file ownership. For example, chown user:group file.txt changes the owner and group of the file. -14. System Backup and Restore: - Backup Tools: - -rsync: Efficient file synchronization tool. -tar: Archive files and directories. -dd: Create disk images. -Example Usage: - -bash -Copy code -# Create a tar archive of a directory -tar -cvf backup.tar /path/to/directory - -# Restore files from a tar archive -tar -xvf backup.tar -15. File System Management: - Checking and Repairing File Systems: - -fsck: File system consistency check and repair. -e2fsck: Specific to the ext2, ext3, and ext4 file systems. -Expanding File Systems: - -resize2fs: Resize ext2, ext3, or ext4 file systems. -lvextend: Extend logical volumes in LVM (Logical Volume Manager). -16. Security and Firewall: - Firewall Configuration: - -iptables: Command-line firewall utility (legacy, being replaced by nftables). -firewalld: Dynamic firewall manager for iptables. -Security Best Practices: - -Regularly update your system and software to patch security vulnerabilities. -Use strong passwords and consider using SSH keys for authentication. -Disable unused services to reduce the attack surface. -17. Disk Management: - Monitoring Disk Usage: - -df: Display disk space usage. -du: Estimate file space usage. -Mounting and Unmounting File Systems: - -mount: Mount a file system. -umount: Unmount a file system. -Conclusion: -Linux is a powerful operating system with a wide range of capabilities. As you explore these advanced topics, remember to always back up your data and proceed with caution, especially when dealing with system-critical operations. Continuous learning and practice will help you become proficient in managing and administering Linux systems. - -User -continue -ChatGPT -Absolutely! Let's continue with more advanced Linux topics: - -18. Shell Customization: - Customizing the Shell Prompt: - You can customize your shell prompt to display information such as the current directory, username, hostname, and more. This can be done by modifying the PS1 variable in your shell's configuration file (e.g., .bashrc for Bash). - -Example PS1 Customization: - -bash -Copy code -export PS1="\u@\h \w $ " -This will display the username (\u), hostname (\h), and current working directory (\w) in the prompt. - -19. Process Management: - Background and Foreground Processes: - -Use & at the end of a command to run it in the background, allowing you to continue using the terminal. -Use fg to bring a background process to the foreground. -Process Prioritization: - -nice: Run a command with a specified priority level. -renice: Change the priority of a running process. -20. Shell Scripting - Advanced: - Variables: - -Use variables to store values and reuse them in your scripts. Variables are case-sensitive and typically uppercase by convention. -Example Variable Usage: - -bash -Copy code -NAME="John" -echo "Hello, $NAME!" -Conditional Statements: - -Use if, elif, and else statements to add conditional logic to your scripts. -Example Conditional Statement: - -bash -Copy code -if [ "$NAME" == "John" ]; then -echo "Hello, John!" -else -echo "Hello, stranger!" -fi -Loops: - -Use for and while loops to iterate over a list of items or until a condition is met. -Example Loop Usage: - -bash -Copy code -for i in {1..5}; do -echo "Iteration $i" -done -21. Networking - Advanced: - SSH (Secure Shell): - -Use SSH for secure remote access to a Linux system. -ssh user@hostname: Connect to a remote system. -SCP (Secure Copy): - -Use SCP to securely copy files between systems. -scp file.txt user@hostname:/path/to/destination -Conclusion: -These advanced Linux topics will help you become more proficient in managing and customizing your Linux system. As you continue to explore Linux, you'll discover even more tools and techniques to enhance your workflow and productivity. Keep experimenting and learning, and you'll soon master the intricacies of Linux administration and scripting. - -User -continue -ChatGPT -Let's delve further into advanced Linux concepts: - -22. Package Management - Advanced: - Package Querying: - -Use package managers to query information about installed packages and repositories. -apt-cache (Ubuntu, Debian) and yum (Fedora, CentOS) can be used for this purpose. -Example Package Query: - -bash -Copy code -apt-cache show package_name -Package Dependencies: - -Package managers can automatically resolve and install dependencies for a package. -Ensure you have enabled the appropriate repositories to access required packages. -23. Kernel Management: - Kernel Modules: - -The Linux kernel can be extended using loadable kernel modules. -Use lsmod to list currently loaded modules and modprobe to load or unload modules. -Kernel Parameters: - -You can pass parameters to the kernel at boot time to customize its behavior. -These parameters can be set in the bootloader configuration (e.g., GRUB). -24. Virtualization: - Virtualization Technologies: - -Linux supports various virtualization technologies, including KVM (Kernel-based Virtual Machine), VirtualBox, and Docker. -These technologies allow you to run virtualized instances of other operating systems or applications. -Example Docker Usage: - -Install Docker using your package manager. -Pull a Docker image: docker pull image_name -Run a Docker container: docker run image_name -25. Monitoring and Performance Tuning: - System Monitoring Tools: - -Use tools like top, htop, and atop to monitor system performance, resource usage, and running processes. -Collect and analyze system logs using tools like syslog-ng or rsyslog. -Performance Tuning: - -Adjust kernel parameters, filesystem settings, and hardware configurations to optimize system performance. -Use profiling tools to identify and eliminate performance bottlenecks in applications. -26. Backup Strategies: - Backup Types: - -Implement a backup strategy that includes full backups, incremental backups, and differential backups. -Use tools like rsnapshot, rsync, or tar to automate backup tasks. -Offsite Backup: - -Store backups offsite to protect against data loss in the event of a disaster. -Use cloud storage services or external storage devices for offsite backups. -27. System Recovery and Rescue: - Recovery Tools: - -In case of system failure, use recovery tools such as fsck or testdisk to repair filesystems and recover data. -Create a bootable USB drive or CD/DVD with a Linux live distribution for system rescue. -Conclusion: -These advanced Linux concepts will help you become a more proficient Linux user and administrator. Remember to always backup important data before making system changes and to test changes in a safe environment. With practice and exploration, you'll continue to expand your Linux knowledge and skills. - -User -continue -ChatGPT -Let's explore a few more advanced Linux topics: - -28. RAID Configuration: - RAID Levels: - -RAID (Redundant Array of Independent Disks) allows you to combine multiple disks into a single logical unit for performance, redundancy, or both. -Common RAID levels include RAID 0, RAID 1, RAID 5, RAID 10, etc., each offering different levels of performance and redundancy. -RAID Configuration: - -Use tools like mdadm (Linux software RAID management) to configure RAID arrays. -Configure RAID arrays in your /etc/mdadm/mdadm.conf file for automatic assembly at boot. -29. Shell Scripting - Advanced Techniques: - Functions: - -Use functions to organize your code and make it more modular and reusable. -Define functions using the function keyword or simply by name. -Example Function Usage: - -bash -Copy code -greeting() { -echo "Hello, $1!" -} - -greeting "John" -Error Handling: - -Implement error handling in your scripts using trap to catch signals and set -e to exit on error. -Use || and && to execute commands based on the success or failure of previous commands. -30. Kernel Tuning: - Kernel Parameters: - -Adjust kernel parameters to optimize performance or enable/disable specific features. -Use the sysctl command to view and modify kernel parameters dynamically. -Example Kernel Parameter Modification: - -bash -Copy code -sysctl -w vm.swappiness=10 -31. Linux Containers: - Containerization Technologies: - -Linux containers, such as Docker, LXC (Linux Containers), and Podman, provide lightweight, isolated environments for running applications. -Containers share the host kernel but have their own filesystem and process space. -Example Docker Usage: - -Pull a Docker image: docker pull image_name -Run a Docker container: docker run image_name -32. High Availability and Clustering: - High Availability (HA): - -Implement HA solutions to ensure continuous operation of critical services. -Use technologies like Pacemaker, Corosync, and Keepalived for HA clustering. -Clustered File Systems: - -Clustered file systems like GFS2 (Global File System 2) and OCFS2 (Oracle Cluster File System 2) allow multiple nodes to access shared storage. -Conclusion: -These advanced Linux topics provide a deeper understanding of Linux systems administration and can help you manage and optimize Linux systems more effectively. Experimenting with these concepts in a controlled environment will enhance your skills and confidence in working with Linux. - -User -continue -ChatGPT -Of course! Let's explore a few more advanced Linux topics: - -33. Container Orchestration: - Orchestration Platforms: - -Orchestration platforms like Kubernetes and Docker Swarm help manage and scale containerized applications. -They automate deployment, scaling, and operations of application containers across clusters of hosts. -Kubernetes: - -Kubernetes is an open-source platform for automating deployment, scaling, and managing containerized applications. -It provides features for container orchestration, including load balancing, service discovery, and automated rollouts and rollbacks. -Docker Swarm: - -Docker Swarm is Docker's native clustering and orchestration tool. -It allows you to create a cluster of Docker hosts and deploy services to the cluster. -34. Automated Configuration Management: - Configuration Management Tools: - -Tools like Ansible, Chef, and Puppet help automate the configuration and management of systems. -They enable you to define the desired state of your infrastructure and automatically enforce that state. -Ansible: - -Ansible is a simple, agentless automation tool that uses SSH to manage systems. -It allows you to automate tasks such as configuration management, application deployment, and orchestration. -Chef: - -Chef is a powerful automation platform that uses a declarative approach to infrastructure configuration. -It allows you to define your infrastructure as code and manage it using Chef's recipes and cookbooks. -35. Monitoring and Logging: - Monitoring Tools: - -Use monitoring tools like Prometheus, Grafana, and Nagios to monitor the health and performance of your systems. -These tools provide metrics, alerts, and dashboards to help you keep an eye on your infrastructure. -Logging: - -Centralized logging solutions like ELK stack (Elasticsearch, Logstash, Kibana) and Fluentd allow you to collect, store, and analyze logs from various sources. -They help you troubleshoot issues, track trends, and ensure compliance. -36. Security Best Practices: - Security Hardening: - -Follow best practices for securing your Linux systems, such as keeping software up to date, using strong passwords, and disabling unused services. -Implement firewalls (e.g., iptables, firewalld) and intrusion detection/prevention systems (e.g., Snort, Suricata) to protect your systems. -Security Auditing: - -Regularly audit your systems for vulnerabilities and compliance with security policies. -Use tools like OpenVAS and Lynis to scan for vulnerabilities and audit system configurations. -Conclusion: -These advanced Linux topics cover a range of areas, from containerization and orchestration to automation, monitoring, and security. By mastering these concepts, you'll be able to manage complex Linux environments more efficiently and securely. Experimenting with these tools and technologies in a lab environment will help you gain practical experience and enhance your skills as a Linux administrator. - -User -continue -ChatGPT -Certainly! Let's explore a few more advanced Linux topics: - -37. Networking - Advanced: - Network Bonding (Teaming): - -Network bonding (also known as NIC teaming or link aggregation) combines multiple network interfaces into a single logical interface for increased bandwidth and redundancy. -Use tools like nmcli or ip to configure network bonding. -Example Network Bonding Configuration: - -bash -Copy code -# Create a bonding interface -nmcli connection add type bond ifname bond0 mode 802.3ad -# Add slave interfaces to the bond -nmcli connection add type ethernet ifname eth0 master bond0 -nmcli connection add type ethernet ifname eth1 master bond0 -Virtual LANs (VLANs): - -VLANs allow you to segment a physical network into multiple logical networks. -Use tools like ip or vlan to configure VLANs. -Example VLAN Configuration: - -bash -Copy code -# Create a VLAN interface -ip link add link eth0 name eth0.10 type vlan id 10 -# Assign an IP address to the VLAN interface -ip addr add 192.168.1.1/24 dev eth0.10 -# Bring up the VLAN interface -ip link set dev eth0.10 up -38. Advanced File System Features: - Logical Volume Management (LVM): - -LVM allows you to manage storage volumes dynamically, including resizing, snapshotting, and moving volumes across physical disks. -Use tools like lvcreate, lvextend, and lvresize to manage logical volumes. -Example LVM Usage: - -bash -Copy code -# Create a new logical volume -lvcreate -L 1G -n mylv myvg -# Extend a logical volume -lvextend -L +500M /dev/myvg/mylv -File System Encryption: - -Encrypt file systems to protect data at rest. -Use tools like cryptsetup to set up encrypted volumes. -Example File System Encryption: - -bash -Copy code -# Create an encrypted volume -cryptsetup luksFormat /dev/sdb1 -# Open the encrypted volume -cryptsetup luksOpen /dev/sdb1 my_encrypted_volume -# Create a file system on the encrypted volume -mkfs.ext4 /dev/mapper/my_encrypted_volume -39. Advanced Package Management: - Package Building: - -Create your own packages for distribution using tools like dpkg-deb or rpmbuild. -Use package managers to install, update, and remove packages on your system. -Example Package Building (Debian): - -bash -Copy code -# Create a Debian package -dpkg-deb --build mypackage -# Install the package -dpkg -i mypackage.deb -Conclusion: -These advanced Linux topics cover networking, storage, package management, and security aspects of Linux system administration. Mastering these concepts will enable you to manage and optimize Linux systems in a variety of environments. Experimenting with these features in a lab environment will help you gain practical experience and enhance your skills as a Linux administrator. - -User -continue -ChatGPT -Absolutely, let's continue with more advanced Linux topics: - -40. Kernel Module Development: - Kernel Modules: - -Kernel modules are pieces of code that can be dynamically loaded and unloaded into the Linux kernel. -They can add new functionality to the kernel, such as device drivers or filesystem support. +- What is Linux? + - Linux is a Unix-like operating system kernel + - first released on September 17, 1991, by Linus Torvalds. + - It's the heart of what we call a Linux distribution, or simply "Linux" in common language. + - Popular distributions include Ubuntu, Fedora, Debian, and CentOS. +- Terminal or Command Line Interface (CLI): + - powerful tool for interacting with the system. + - referred to as the terminal or CLI. +- File System Hierarchy: + - Root directory ("/") is the top-level directory from which everything branches out. + - Key directories include + - /bin (binaries), + - /home (user directories), + - /etc (system configuration files), and + - /var (variable data such as logs). +- Choosing a Distribution: + - Ubuntu is often recommended for beginners due to its user-friendly nature and vast community support. + +- Basic Commands: + - Navigating the File System: + - pwd: Print the current working directory. + - ls: List directory contents. + - cd: Change directory. + - mkdir: Make directory. + - rmdir: Remove directory. + - cp: Copy files or directories. + - mv: Move or rename files or directories. + - rm: Remove files or directories. + - touch: Create an empty file. + - Working with Files: + - cat: Concatenate and display file content. + - less or more: View file contents one page at a time. + - head and tail: Display the beginning or end of a file, respectively. + - grep: Search for patterns within files. +- User Management: + - Adding Users: + - sudo adduser username: Add a new user. + - sudo usermod -aG sudo username: Add the user to the sudo group for administrative privileges. +- Permissions: + - read (r), + - write (w), and + - execute (x), + - three categories: owner, group, and others. +- Changing Permissions: + - chmod: Change file permissions. + - chown: Change file ownership. +- Package Management: + - Installing Software: + - Package managers like + - apt (Ubuntu, Debian) and + - yum (Fedora, CentOS) + - to install, update, and remove software. + - sudo apt update: Update package lists. + - sudo apt install packagename: Install a package. + - sudo apt remove packagename: Remove a package. +- System Maintenance: + - Updating the System: + - sudo apt update: Update package lists. + - sudo apt upgrade: Upgrade installed packages. +- System Information: + - uname -a: Display system information. + - df -h: Display disk space usage. + - free -h: Display memory usage. +- Getting Help: + - man command, e.g., man ls for information about the ls command. +- Shell Script: + - .sh extension, e.g., myscript.sh. + - starting with #!/bin/bash (the shebang line) to specify the shell to use. + - executable with chmod +x myscript.sh. + - Run the script with ./myscript.sh. + - Example Script: + - #!/bin/bash + - echo "Hello, World!" +- Text Processing: + - Using sed and awk: + - sed is a stream editor for filtering and transforming text. + - awk is a powerful text processing tool for pattern scanning and processing. + - Example Usage: + - # Using sed to replace text + - echo "Hello, World!" | sed 's/Hello/Hi/' + - # Using awk to print specific columns + - echo "1 2 3" | awk '{print $2}' +- Managing Processes: + - ps: Display information about processes. + - top: Interactive process viewer. + - kill: Send a signal to a process. Use kill -9 PID to force kill a process. +- Networking: + - ifconfig or ip addr: Display network interface information. + - ping: Check network connectivity to a host. + - netstat or ss: Display network connections, routing tables, and more. + - Configure the IP address of the interface eth0. + - ifconfig eth0 192.168.1.10 netmask 255.255.255.0: + - Add a default gateway. + - route add default gw 192.168.1.1: +- System Administration: + - journalctl: View system logs. + - dmesg: Display kernel ring buffer messages. + - Managing Services: + - systemctl: Control system services. E.g., systemctl start sshd to start the SSH service. +- File Permissions and Ownership: + - drwxr-xr-x indicates a directory where the + - owner has read, write, and execute permissions, + - while others have only read and execute permissions. + - Changing Permissions: + - chmod: Change file permissions. + - example, chmod 755 file.txt sets read, write, and execute permissions for the owner, and + - read and execute permissions for group and others. + - chown: Change file ownership. + - example, chown user:group file.txt changes the owner and group of the file. +- System Backup and Restore: + - Backup Tools: + - rsync: Efficient file synchronization tool. + - tar: Archive files and directories. + - dd: Create disk images. + - Example Usage: + - # Create a tar archive of a directory + - tar -cvf backup.tar /path/to/directory + - # Restore files from a tar archive + - tar -xvf backup.tar + - File System Management: + - fsck: File system consistency check and repair. + - e2fsck: Specific to the ext2, ext3, and ext4 file systems. + - Expanding File Systems: + - resize2fs: Resize ext2, ext3, or ext4 file systems. + - lvextend: Extend logical volumes in LVM (Logical Volume Manager). +- Security and Firewall: + - Firewall Configuration: + - iptables: Command-line firewall utility (legacy, being replaced by nftables). + - firewalld: Dynamic firewall manager for iptables. +- Disk Management: + - Monitoring Disk Usage: + - df: Display disk space usage. + - du: Estimate file space usage. + - Mounting and Unmounting File Systems: + - mount: Mount a file system. + - umount: Unmount a file system. +- Shell Customization: + - modifying the PS1 variable in your shell's configuration file (e.g., .bashrc for Bash). + - Example PS1 Customization: + - export PS1="\u@\h \w $ " + - This will display the username (\u), hostname (\h), and current working directory (\w) in the prompt. +- Process Management: + - Background and Foreground Processes: + - Use & at the end of a command to run it in the background, allowing you to continue using the terminal. + - Use fg to bring a background process to the foreground. +- Process Prioritization: + - nice: Run a command with a specified priority level. + - renice: Change the priority of a running process. +- Shell Scripting - Advanced: + - Variables: + - NAME="John" + - echo "Hello, $NAME!" + - Conditional Statements: + - if, elif, and else statements to add conditional logic to your scripts. + - if [ "$NAME" == "John" ]; then + - echo "Hello, John!" + - else + - echo "Hello, stranger!" + - fi + - Loops: + - for and while loops + ```for i in {1..5}; do + echo "Iteration $i" + done``` +- Networking - Advanced: + - SSH (Secure Shell): + - secure remote access to a Linux system. + - ssh user@hostname: Connect to a remote system. + - SCP (Secure Copy): + - securely copy files between systems. + - scp file.txt user@hostname:/path/to/destination +- Package Management - Advanced: + - Package Querying: + - apt-cache (Ubuntu, Debian) and + - yum (Fedora, CentOS) can be used for this purpose. + - Example Package Query: + - apt-cache show package_name +- Kernel Management: + - Kernel Modules: + - The Linux kernel can be extended using loadable kernel modules. + - lsmod: + - list currently loaded modules and modprobe to load or unload modules. + - Kernel Parameters: + - pass parameters to the kernel at boot time to customize its behavior. + - These parameters can be set in the bootloader configuration (e.g., GRUB). +- Virtualization: + - Virtualization Technologies: + - KVM (Kernel-based Virtual Machine), + - VirtualBox, + - Docker. + - Pull a Docker image: + - docker pull image_name + - Run a Docker container: + - docker run image_name +- Monitoring and Performance Tuning: + - System Monitoring Tools: + - top, + - htop, + - atop + - Collect and analyze system logs + - syslog-ng + - rsyslog. + - Performance Tuning: + - kernel parameters, filesystem settings, and hardware configurations to optimize system performance. + - profiling tools to identify and eliminate performance bottlenecks in applications. +- Backup Strategies: + - Backup Types: + - full backups, + - incremental backups, + - differential backups. + - tools + - rsnapshot, + - rsync, + - tar +- System Recovery and Rescue: + - Recovery Tools: to repair filesystems and recover data. + - fsck + - testdisk +- RAID Configuration: + - RAID (Redundant Array of Independent Disks) + - combine multiple disks into a single logical unit for performance, redundancy, or both. + - RAID levels + - RAID 0, + - RAID 1, + - RAID 5, + - RAID 10, etc., + - each offering different levels of performance and redundancy. + - RAID Configuration tools: + - mdadm (Linux software RAID management) + - to configure RAID arrays. + - Configure RAID arrays: + - /etc/mdadm/mdadm.conf file for automatic assembly at boot. +- Shell Scripting - Advanced Techniques: + - Functions: + - Example Function Usage: + ```greeting() { + echo "Hello, $1!" + } + greeting "John"``` +- Error Handling: + - set -e to exit on error. + - Use || and && to execute commands based on the success or failure of previous commands. +- Kernel Tuning: + - Kernel Parameters: + - sysctl command + - view and modify kernel parameters dynamically. + - Example Kernel Parameter Modification: + - sysctl -w vm.swappiness=10 +- Linux Containers: + - Containerization Technologies: + - Docker, + - LXC (Linux Containers), + - Podman + - Containers share the host kernel but have their own filesystem and process space. +- High Availability and Clustering: + - High Availability (HA) technologies + - Pacemaker, + - Corosync, + - Keepalived + - Clustered File Systems: allow multiple nodes to access shared storage. + - GFS2 (Global File System 2) + - OCFS2 (Oracle Cluster File System 2) +- Container Orchestration: + - Orchestration Platforms: + - help manage and scale containerized applications. + - automate deployment, scaling, and operations of application containers across clusters of hosts. + - Examples of orchestration platforms: + - Kubernetes + - open-source platform + - automating deployment, + - scaling, + - managing containerized applications. + - load balancing, + - service discovery, + - automated rollouts and rollbacks. + - Docker Swarm + - Docker's native clustering and orchestration tool. + - create a cluster of Docker hosts + - deploy services to the cluster. +- Automated Configuration Management: + - Configuration Management Tools: automate the configuration and management of systems + - Ansible + - simple, + - agentless automation tool + - uses SSH to manage systems. + - configuration management, + - application deployment, + - orchestration. + - Chef + - declarative approach to infrastructure configuration. + - define infrastructure as code + - manage it using Chef's recipes and cookbooks. + - Puppet +- Monitoring and Logging: + - Monitoring Tools: + - monitor the health and performance of your systems. + - provide metrics, alerts, and dashboards + - Tools include: + - Prometheus, + - Grafana, + - Nagios + - Logging: + - collect, store, and analyze logs from various sources. + - troubleshoot issues, track trends, and ensure compliance. + - Centralized logging solutions + - ELK stack (Elasticsearch, Logstash, Kibana) + - Fluentd +- Security Best Practices: + - Security Hardening: + - Implement firewalls + - iptables + - firewalld + - intrusion detection/prevention systems + - Snort, + - Suricata + - Security Auditing tools: scan for vulnerabilities and audit system configurations. + - OpenVAS + - Lynis +- Networking - Advanced: + - Network Bonding (Teaming): + - NIC teaming or link aggregation + - combines multiple network interfaces into a single logical interface + - for increased bandwidth and redundancy. + - tools to configure network bonding. + - nmcli + - ip + - Example Network Bonding Configuration: + ```# Create a bonding interface + nmcli connection add type bond ifname bond0 mode 802.3ad + # Add slave interfaces to the bond + nmcli connection add type ethernet ifname eth0 master bond0 + nmcli connection add type ethernet ifname eth1 master bond0 + ``` +- Virtual LANs (VLANs): + - segment a physical network into multiple logical networks. + - tools to configure VLANs. + - ip + - vlan + - Example VLAN Configuration: + ```# Create a VLAN interface + ip link add link eth0 name eth0.10 type vlan id 10 + # Assign an IP address to the VLAN interface + ip addr add 192.168.1.1/24 dev eth0.10 + # Bring up the VLAN interface + ip link set dev eth0.10 up``` +- Advanced File System Features: + - Logical Volume Management (LVM): + - manage storage volumes dynamically, + - including resizing, + - snapshotting, and + - moving volumes across physical disks. + - tools to manage logical volumes. + - lvcreate, + - lvextend, + - lvresize + - Example LVM Usage: + ```# Create a new logical volume + lvcreate -L 1G -n mylv myvg + # Extend a logical volume + lvextend -L +500M /dev/myvg/mylv +- File System Encryption: + - tools to set up encrypted volumes + - cryptsetup + - Example File System Encryption: + ```# Create an encrypted volume + cryptsetup luksFormat /dev/sdb1 + # Open the encrypted volume + cryptsetup luksOpen /dev/sdb1 my_encrypted_volume + # Create a file system on the encrypted volume + mkfs.ext4 /dev/mapper/my_encrypted_volume + ``` +- Advanced Package Management: + - Package Building: + - Create your own packages for distribution using tools like dpkg-deb or rpmbuild. + - Use package managers to install, update, and remove packages on your system. + - Example Package Building (Debian): + ```# Create a Debian package + dpkg-deb --build mypackage + # Install the package + dpkg -i mypackage.deb + ``` +- Kernel Module Development: + - Kernel Modules: + - pieces of code that can be dynamically loaded and unloaded into the Linux kernel. + - add new functionality to the kernel, such as device drivers or filesystem support. Module Development Steps: Install kernel headers and development tools. diff --git a/Linux/tar_basics.md b/Linux/tar_basics.md new file mode 100644 index 0000000..7933710 --- /dev/null +++ b/Linux/tar_basics.md @@ -0,0 +1,31 @@ +- tar command + - tar -cf archive.tar file1.txt + - c: create a new archive + - f: specifies the name of the archive file + - tar -xf archive.tar + - x: extract the contents of the archive + - f: specifies the name of the archive file + - tar -cvf archive.tar file1 file2 + - c: create a new archive + - v: verbose mode, shows the files being archived + - f: specifies the name of the archive file + - tar -xvf archive.tar + - x: extract the contents of the archive + - v: verbose mode, shows the files being extracted + - f: specifies the name of the archive file + - tar -tvf archive.tar + - t: list the contents of the archive + - v: verbose mode, shows the files in the archive + - f: specifies the name of the archive file + - tar -cvzf archive.tar.gz file1 file2 + - z: compress the archive using gzip + - tar -xvzf archive.tar.gz + - z: decompress the archive using gzip + - tar -cvjf archive.tar.bz2 file1 file2 + - j: compress the archive using bzip2 + - tar -xvjf archive.tar.bz2 + - j: decompress the archive using bzip2 + - tar -cvJf archive.tar.xz file1 file2 + - J: compress the archive using xz + - tar -xvJf archive.tar.xz + - J: decompress the archive using xz \ No newline at end of file diff --git a/Linux/user_accounts_and_paswords.md b/Linux/user_accounts_and_paswords.md new file mode 100644 index 0000000..dc39522 --- /dev/null +++ b/Linux/user_accounts_and_paswords.md @@ -0,0 +1,114 @@ +# Linux Account Files – Compare & Contrast (Ubuntu/RHEL) + +Here’s a clean **compare & contrast table** of the main Linux account-related files (Ubuntu/RHEL applicable). + +--- + +## 🔐 Core Authentication & Account Files + +| File | Purpose | Stores Password? | Who Can Read | Who Can Modify | Used During Login? | Used During User Creation? | +|------------------------|-----------------------------------------------------|-----------------------------------|--------------|----------------|-------------------------------------|----------------------------| +| `/etc/passwd` | User account metadata | ❌ No (only `x` placeholder) | Everyone | root | ✅ Yes (username lookup, UID, shell) | ✅ Yes (entry created) | +| `/etc/shadow` | Encrypted password hashes + aging info | ✅ Yes | root only | root | ✅ Yes (password verification) | ✅ Yes (entry created) | +| `/etc/login.defs` | Default account creation & password policy settings | ❌ No | Everyone | root | ❌ No (indirect only) | ✅ Yes (read by `useradd`) | +| `/etc/default/useradd` | Default values for `useradd` | ❌ No | Everyone | root | ❌ No | ✅ Yes | +| `/etc/group` | Group definitions | ❌ No (group password rarely used) | Everyone | root | ✅ Yes (group membership) | ✅ Yes | +| `/etc/gshadow` | Secure group password file | ✅ Yes (group password) | root only | root | ⚠️ Rarely | ✅ Yes | +| `/etc/pam.d/*` | Authentication policy control | ❌ No (controls behavior) | Everyone | root | ✅ YES (very important) | ⚠️ Indirect | + +--- + +## 🧠 Field-Level Comparison + +| Aspect | `/etc/passwd` | `/etc/shadow` | `/etc/login.defs` | +|----------------------|----------------------------------|----------------------------------|-----------------------| +| Format | Colon-separated | Colon-separated | Key-value pairs | +| Example Entry | `atul:x:1000:1000:...:/bin/bash` | `atul:$y$...:19700:0:99999:7:::` | `PASS_MAX_DAYS 99999` | +| Password Location | Placeholder only | Actual hash | Not stored | +| Password Expiry Info | ❌ No | ✅ Yes | Default values only | +| Security Sensitivity | Medium | High | Low | + +--- + +## 🔑 Password-Related Comparison + +| Feature | `/etc/passwd` | `/etc/shadow` | PAM (`/etc/pam.d/`) | +|---------------------------|---------------|----------------|---------------------| +| Password hash stored | ❌ | ✅ | ❌ | +| Password expiry days | ❌ | ✅ | May enforce | +| Password complexity | ❌ | ❌ | ✅ | +| Account locking | ❌ | ✅ (`!` or `*`) | ✅ | +| Hashing algorithm control | ❌ | ❌ | ✅ | + +--- + +## 🐧 Ubuntu vs RHEL Behavior (Important Difference) + +| Feature | Ubuntu | RHEL/CentOS | +|------------------------|-----------------------|--------------------| +| Default hash algorithm | yescrypt (`$y$`) | SHA-512 (`$6$`) | +| Main policy control | PAM dominant | `login.defs` + PAM | +| SELinux | ❌ Disabled by default | ✅ Enabled | +| AppArmor | ✅ Yes | ❌ No (default) | + +--- + +## 🔄 When Each File Is Used + +| Operation | Files Involved | +|-----------------------------|---------------------------------------------------------------------------------------| +| Create user (`useradd`) | `/etc/login.defs`, `/etc/default/useradd`, `/etc/passwd`, `/etc/shadow`, `/etc/group` | +| Change password (`passwd`) | `/etc/shadow`, PAM | +| Login attempt | `/etc/passwd`, `/etc/shadow`, PAM | +| Check group membership | `/etc/group` | +| Enforce password complexity | PAM only | + +--- + +## 🏗 Simple Mental Model + +| Category | Files | +|-------------|----------------------------------| +| 📁 Identity | `/etc/passwd` | +| 🔐 Secrets | `/etc/shadow`, `/etc/gshadow` | +| 📜 Policy | `/etc/login.defs`, `/etc/pam.d/` | +| 👥 Groups | `/etc/group` | + +- prevent a user from logging in + - in shadow file, add `!` or `*` before the password hash to lock the account + - in passwd file, change the shell to `/sbin/nologin` or `/bin/false` to prevent login + - in login.defs, set `PASS_MAX_DAYS` to 0 to expire the password immediately, but this is less common for locking accounts +- syntax in passwd file: `atul:x:1000:1000:...:/bin/bash` + - Meaning of each field: + - `atul`: username + - `x`: password placeholder (actual hash is in shadow file) + - `1000`: user ID (UID) + - 1000 is the first non-system user ID on most Linux distributions, + - so this indicates a regular user account + - `1000`: group ID (GID) + - 1000 is the primary group for this user, which usually has the same name as the username + - `...`: user info (e.g., full name, contact info) + - `/bin/bash`: default shell + - Allowed changes to lock account: + - change shell to `/sbin/nologin` or `/bin/false` + - change password field to `!` or `*` to prevent password authentication +- syntax in shadow file: `atul:$y$...:19700:0:99999:7:::` + - Meaning of each field: + - `atul`: username (must match passwd file) + - `$y$...`: password hash (using yescrypt in Ubuntu) + - `19700`: last password change date (days since Jan 1, 1970) + - `0`: minimum password age (days before change allowed) + - `99999`: maximum password age (days before expiration) + - `7`: warning period (days before expiration to warn user) + - `:::`: reserved fields for inactivity and account expiration, usually left empty + - Allowing changes to lock account: + - change password hash to `!` or `*` to prevent password authentication +- complete syntax of login.defs file: + - `PASS_MAX_DAYS 99999`: maximum password age (days before expiration) + - `PASS_MIN_DAYS 0`: minimum password age (days before change allowed) + - `PASS_WARN_AGE 7`: warning period (days before expiration to warn user) + - `UID_MIN 1000`: minimum user ID for regular users + - `GID_MIN 1000`: minimum group ID for regular groups + - `UMASK 022`: default file permissions for new files (e.g., 755 for directories, 644 for files) +- command to update the password for a user: + - `passwd atul` (will prompt for new password and update shadow file) \ No newline at end of file diff --git a/Linux/xargs.md b/Linux/xargs.md new file mode 100644 index 0000000..dc16343 --- /dev/null +++ b/Linux/xargs.md @@ -0,0 +1,14 @@ +- xargs command + - used to build and execute command lines from standard input + - often used in combination with other commands that produce output, such as find, grep, or ls + - syntax: `xargs [options] [command]` + - example: `find . -name "*.txt" | xargs rm` (find all .txt files and delete them) + - options: + - `-n`: specify the maximum number of arguments to be used per command line + - `-p`: prompt the user before executing each command line + - `-0`: input items are terminated by a null character instead of whitespace (useful for handling filenames with spaces) + - `-I {}`: replace occurrences of {} in the command with the input item + - xargs is a powerful tool for processing large lists of items and can help avoid issues with command line length limits when dealing with many files or arguments. + - example commands: + - `ls | xargs -n 1 echo` (list files and echo each filename on a new line) + - `grep "pattern" file.txt | xargs -I {} echo "Found: {}"` (search for a pattern in a file and print each match with a prefix) diff --git a/LinuxCommands.md b/LinuxCommands.md index bd9b845..a1479c3 100755 --- a/LinuxCommands.md +++ b/LinuxCommands.md @@ -109,6 +109,8 @@ - which ls - /usr/bin/ls 48. watch: + - Run a command repeatedly and see live updates. + - watch -n 1 "ls -lh" 49. history -c: clears the history of commands 50. sudo apt-get install "program name"/ sudo apt install "program, name"- install applications 51. sudo remove "programName" - Uninstall a program @@ -323,6 +325,9 @@ using graphics.h library to run computer graphics programs - nvcc "file_name.cu" -o "output_file_name" 95. cat /etc/group: display all the groups associated with the system 96. pstree: display all processes in a tree format +97. tree + - Show directory structure beautifully. + - tree -L 2: limit the depth of the directory tree to 2 levels 97. ssh-copy-id username@192.168.1.23: Set up passwordless SSH access to a remote host using an available key pair 98. uptime: facts about the system, - including the length of time the system has been "up" (running) since its last re-boot, @@ -406,6 +411,8 @@ using graphics.h library to run computer graphics programs - chgrp mygroup myfile.txt 114. passwd: - Change a user's password + - command to update the password for a user: + - `passwd atul` (will prompt for new password and update shadow file) 115. exit: - logout and close shell 116. shutdown [options] [time] [message] @@ -749,16 +756,411 @@ using graphics.h library to run computer graphics programs - detailed information about the system's hardware components from the DMI (Desktop Management Interface) tables. 178. launch chromium browser - ```chromium-browser``` -- list all available packages that can be installed RHEL +179. list all available packages that can be installed RHEL - sudo yum list available -- list of packages installed - - Debian-based systems (e.g., Ubuntu): - - dpkg --get-selections - - Using apt: - - apt list --installed - - Red Hat-based systems (e.g., RHEL, CentOS): - - rpm -qa - - dnf/yum list installed - - Generic: - - ls /var/lib/dpkg/info for Debian - - ls /var/lib/rpm for Red Hat + - list of packages installed + - Debian-based systems (e.g., Ubuntu): + - dpkg --get-selections + - Using apt: + - apt list --installed + - Red Hat-based systems (e.g., RHEL, CentOS): + - rpm -qa + - dnf/yum list installed + - Generic: + - ls /var/lib/dpkg/info for Debian + - ls /var/lib/rpm for Red Hat +180. Clear local cache/apt repo + - Debian/Ubuntu: sudo apt-get clean + - RHEL7-/CentOS: sudo yum clean all + - RHEL8+/Fedora: sudo dnf clean all +181. get hardware information about the system: + - lscpu + - lsusb + - lspci + - lsblk + - free -h + - df -h + - inxi -Fxz + - hwinfo + - sudo dmidecode + - get hardware device profile of system + - lshw +182. update repositories and packages: + - Debian/Ubuntu: sudo apt-get update && sudo apt-get upgrade + - RHEL7-/CentOS: sudo yum update + - RHEL8+/Fedora: sudo dnf update +183. install a package: + - Debian/Ubuntu: sudo apt-get install package_name + - RHEL7-/CentOS: sudo yum install package_name + - RHEL8+/Fedora: sudo dnf install package_name +184. remove a package: + - Debian/Ubuntu: sudo apt-get remove package_name + - RHEL7-/CentOS: sudo yum remove package_name + - RHEL8+/Fedora: sudo dnf remove package_name +185. search for a package: + - Debian/Ubuntu: apt-cache search package_name + - RHEL7-/CentOS: yum search package_name + - RHEL8+/Fedora: dnf search package_name +186. list installed packages: + - Debian/Ubuntu: dpkg --get-selections or apt list --installed + - RHEL7-/CentOS: rpm -qa or yum list installed + - RHEL8+/Fedora: rpm -qa or dnf list installed +187. check for package updates: + - Debian/Ubuntu: apt list --upgradable + - RHEL7-/CentOS: yum check-update + - RHEL8+/Fedora: dnf check-update +188. view package details: + - Debian/Ubuntu: apt-cache show package_name + - RHEL7-/CentOS: yum info package_name + - RHEL8+/Fedora: dnf info package_name +189. View all currently set env variables: + - printenv + - env +190. Add new users to the system: + - sudo adduser username (Ubuntu) + - sudo useradd username (RHEL/CentOS) +191. Change user password: + - sudo passwd username +192. Delete a user: + - sudo deluser username (Ubuntu) + - sudo userdel username (RHEL/CentOS) +193. List all users: + - cat /etc/passwd + - cut -d: -f1 /etc/passwd +194. List all groups: + - cat /etc/group + - cut -d: -f1 /etc/group +195. Add a user to a group: + - sudo usermod -aG groupname username + - Add new user to one or more groups cent os + - sudo useradd -G group1,group2 username + - Add new user to one or more groups ubuntu + - sudo adduser username group1 + - sudo adduser username group2 +196. Remove a user from a group: + - sudo gpasswd -d username groupname +197. Check group membership of a user: + - groups username + - id username +198. Change file permissions: + - chmod 755 filename + - chmod u=rwx,g=rx,o=rx filename +199. Change file ownership: + - sudo chown username:groupname filename +200. View file permissions and ownership: + - ls -l filename + - stat filename +201. Enable or disable a service: + - systemctl start service_name + - systemctl stop service_name + - systemctl enable service_name + - systemctl disable service_name + - systemctl status service_name +202. View running processes: + - ps aux + - top + - htop +203. Kill a process: + - kill PID + - kill -9 PID + - pkill process_name + - killall process_name +204. View system logs: + - journalctl + - tail -f /var/log/syslog (Debian/Ubuntu) + - tail -f /var/log/messages (RHEL/CentOS) +205. Check disk usage: + - df -h + - du -sh /path/to/directory +206. Check memory usage: + - free -h + - top + - htop +207. Check CPU usage: + - top + - htop + - mpstat +208. Check network status: + - ifconfig + - ip a + - netstat -tuln + - ss -tuln +209. Check open ports: + - netstat -tuln + - ss -tuln +210. Check firewall status: + - sudo ufw status (Debian/Ubuntu) + - sudo firewall-cmd --state (RHEL/CentOS) + - sudo systemctl status firewalld (RHEL/CentOS) +211. Enable or disable firewall: + - sudo ufw enable (Debian/Ubuntu) + - sudo ufw disable (Debian/Ubuntu) + - sudo systemctl start firewalld (RHEL/CentOS) + - sudo systemctl stop firewalld (RHEL/CentOS) +212. Check SELinux status: + - sestatus + - getenforce +213. Enable or disable SELinux: + - sudo setenforce 1 (enable) + - sudo setenforce 0 (disable) + - To make the change permanent, edit the /etc/selinux/config file and set SELINUX=enforcing or SELINUX=permissive or SELINUX=disabled +214. Check system uptime: + - uptime + - who -b +215. Check system load: + - uptime + - top + - htop +216. Check system time and date: + - date + - timedatectl +217. Set system time and date: + - sudo timedatectl set-time "YYYY-MM-DD HH:MM:SS" + - sudo timedatectl set-timezone Timezone + - e.g. + - sudo timedatectl set-time "2023-09-27 20:00:00" + - sudo timedatectl set-timezone Asia/Kolkata + - sudo date -s "YYYY-MM-DD HH:MM:SS" +218. Check system hostname: + - hostname + - hostnamectl +219. Set system hostname: + - sudo hostnamectl set-hostname new_hostname + - sudo hostname new_hostname +220. Check system architecture: + - uname -m + - arch + - lscpu | grep Architecture +221. Check kernel version: + - uname -r + - cat /proc/version + - lsb_release -a +222. Check Linux distribution: + - lsb_release -a + - cat /etc/os-release + - cat /etc/issue + - cat /etc/*-release +223. Check system information: + - inxi -Fxz + - lshw + - hwinfo + - sudo dmidecode +224. Check system resource usage: + - top + - htop + - vmstat + - iostat + - free -h + - df -h +225. Check system performance: + - top + - htop + - vmstat + - iostat + - sar + - mpstat +226. Check system processes: + - ps aux + - top + - htop + - pgrep process_name + - pidof process_name +227. Check system services: + - systemctl list-units --type=service + - systemctl status service_name + - service --status-all +228. Check system users: + - cat /etc/passwd + - cut -d: -f1 /etc/passwd + - getent passwd +229. Check system groups: + - cat /etc/group + - cut -d: -f1 /etc/group + - getent group +230. Check system environment variables: + - printenv + - env + - set +231. Enable Kernel Modules: + - sudo modprobe module_name +232. List Loaded Kernel Modules: + - lsmod +233. Remove a Kernel Module: + - sudo modprobe -r module_name +234. Check Kernel Messages: + - dmesg +235. Check System Boot Time: + - who -b +236. Check System Reboot History: + - last reboot +237. Check System Shutdown History: + - last shutdown +238. Check System Login History: + - last +239. Check System Failed Login Attempts: + - lastb +240. Check System Uptime: + - uptime +241. Encrypt a file using GPG: + - gpg -c filename +242. Decrypt a file using GPG: + - gpg filename.gpg +243. Generate a GPG key pair: + - gpg --full-generate-key +244. List GPG keys: + - gpg --list-keys +245. Export a GPG public key: + - gpg --export -a "User Name" > public_key.asc +246. Import a GPG public key: + - gpg --import public_key.asc +247. Sign a file using GPG: + - gpg --sign filename +248. Verify a GPG signature: + - gpg --verify filename.sig +249. Create a GPG detached signature: + - gpg --detach-sign filename +250. Verify a GPG detached signature: + - gpg --verify filename.sig filename +251. Delete package with configuration files: + - Debian/Ubuntu: sudo apt-get purge package_name + - RHEL7-/CentOS: sudo yum remove package_name --remove-leaves + - RHEL8+/Fedora: sudo dnf remove package_name --remove-leaves +252. Dump socket statistics(ss): + - ss -s +253. Allow only command-line access to a user: + - sudo usermod -s /usr/sbin/nologin username + - sudo usermod -s /bin/false username + - Allow only command-line access to a user using systemctl + - sudo systemctl disable --now username.service + - sudo systemctl mask --now username.service + - sudo isolate multi-user.target + - used for isolating a user to a specific target (runlevel) in systemd, + - such as multi-user.target, which is commonly used for command-line access. +254. tac command + - reverse of cat command + - ```tac filename``` + - read file in reverse order, starting from the last line and ending with the first line. +255. rev command + - reverse the characters in each line of a file + - ```rev filename``` + - ```echo "Hello World" | rev``` + - output: "dlroW olleH" +256. nl command + - number lines of a file + - ```nl filename``` + - ```echo -e "Line 1\nLine 2\nLine 3" +257. yes command + - output a string repeatedly until killed + - ```yes "Hello World"``` + - output: "Hello World" repeated indefinitely + - ```yes | head -n 5``` + - output: "y" repeated 5 times +258. factor command + - factor a number into its prime factors + - ```factor 60``` + - output: "60: 2 2 3 5" + - ```factor 100``` + - output: "100: 2 2 5 5" +259. seq command + - print a sequence of numbers + - ```seq 1 10``` + - output: "1 2 3 4 5 6 7 8 9 10" + - ```seq 0 2 10``` + - output: "0 2 4 6 8 10" +260. Repeat last command. + - ```!!``` + - ```!n``` (where n is the command number in history) +261. Repeat last command starting with a specific string. + - ```!string``` + - e.g. ```!ls``` will repeat the last command that starts with "ls" +262. Repeat last command with a specific string replaced. + - ```^old^new``` + - e.g. ```^foo^bar``` will repeat the last command, replacing the first occurrence of "foo" with "bar" +263. Display the last command with a specific string replaced, without executing it. + - ```echo !!:s/old/new/``` + - e.g. ```echo !!:s/foo/bar/``` will display the last command with the first occurrence of "foo" replaced by "bar", but will not execute the command. +264. Display the last command with all occurrences of a specific string replaced, without executing it. + - ```echo !!:gs/old/new/``` + - e.g. ```echo !!:gs/foo/bar/``` will display the last command with all occurrences of "foo" replaced by "bar", but will not execute the command. +265. lsof + - list open files + - see which process is using a specific file or port + - ```lsof -i :80``` + - ```lsof -u username``` + - ```lsof -p PID``` + - ```lsof +c 0``` + - ```lsof -nP``` + - ```lsof -r 5``` + - ```lsof -iTCP -sTCP:LISTEN``` + - ```lsof -iUDP``` + - ```lsof -i :80 -sTCP:LISTEN``` + - ```lsof -i :80 -sTCP:ESTABLISHED``` + - ```lsof -i :80 -sTCP:TIME_WAIT``` +266. Find top 5 biggest files: + - du -ah . | sort -rh | head -5 +266. strace + - trace system calls and signals + - ```strace -p PID``` + - ```strace -e trace=file command``` + - ```strace -e trace=network command``` + - ```strace -e trace=process command``` + - ```strace -e trace=memory command``` + - ```strace -e trace=signal command``` + - ```strace -e trace=desc command``` + - ```strace -e trace=ipc command``` + - ```strace -e trace=all command``` + - ```strace -o output.txt command``` + - ```strace -c command``` + - ```strace -f command``` + - ```strace -ff command``` +267. ltrace + - trace library calls + - ```ltrace -p PID``` + - ```ltrace command``` + - ```ltrace -e function command``` + - ```ltrace -e 'function1|function2' command``` + - ```ltrace -o output.txt command``` + - ```ltrace -c command +268. time + - measure the execution time of a command + - ```time command``` + - ```/usr/bin/time -v command``` + - ```/usr/bin/time -f "Time: %E, CPU: %P, Memory: %M KB" command``` + - ```/usr/bin/time -o output.txt command``` +269. nice + - run a command with a modified scheduling priority + - Range: -20 to 19 + - Default: 0 + - Lower value → Higher priority (-20) + - Higher value → Lower priority (19) + - ```nice command``` + - ```nice -n 10 command``` + - ```nice -n -10 command``` + - Check Nice Value of Running Process + - ps -o pid,comm,nice -p PID + - ps -o pid,ni,cmd -p 1234 + - Change Priority of Running Process (renice) + - renice 15 -p 1234 + - Only root can set negative values (increase priority) +270. ionice + - set or get the I/O scheduling class and priority of a process + - I/O Scheduling Classes: + - Class 0: None (no I/O scheduling) + - Class 1: Real-time (highest priority) + - Class 2: Best-effort (default) + - Class 3: Idle (lowest priority) + - ```ionice -c 2 -n 0 command``` + - ```ionice -c 2 -n 7 command``` + - ```ionice -c 1 command``` +271. chrt + - manipulate real-time attributes of a process + - Scheduling Policies: + - SCHED_FIFO: First In, First Out (real-time) + - SCHED_RR: Round Robin (real-time) + - SCHED_OTHER: Default Linux time-sharing + - SCHED_BATCH: For batch processing + - SCHED_IDLE: For very low priority tasks + - ```chrt -f 99 command``` + - ```chrt -r 50 command``` + - ```chrt -o command``` +272. \ No newline at end of file diff --git a/Machine Learning/AglomerativeHierarchialClustering.webp b/Machine Learning/AglomerativeHierarchialClustering.webp new file mode 100644 index 0000000..af7528f Binary files /dev/null and b/Machine Learning/AglomerativeHierarchialClustering.webp differ diff --git a/Machine Learning/AglomerativeHierarchialClustering.webp:Zone.Identifier b/Machine Learning/AglomerativeHierarchialClustering.webp:Zone.Identifier new file mode 100644 index 0000000..e69de29 diff --git a/Machine Learning/Fuzzy.webp b/Machine Learning/Fuzzy.webp new file mode 100644 index 0000000..049e275 Binary files /dev/null and b/Machine Learning/Fuzzy.webp differ diff --git a/Machine Learning/Fuzzy.webp:Zone.Identifier b/Machine Learning/Fuzzy.webp:Zone.Identifier new file mode 100644 index 0000000..e69de29 diff --git a/Machine Learning/GMM.webp b/Machine Learning/GMM.webp new file mode 100644 index 0000000..0a2138f Binary files /dev/null and b/Machine Learning/GMM.webp differ diff --git a/Machine Learning/GMM.webp:Zone.Identifier b/Machine Learning/GMM.webp:Zone.Identifier new file mode 100644 index 0000000..e69de29 diff --git a/Machine Learning/MLE.png b/Machine Learning/MLE.png new file mode 100644 index 0000000..c4e34e9 Binary files /dev/null and b/Machine Learning/MLE.png differ diff --git a/Machine Learning/Multinomial.png b/Machine Learning/Multinomial.png new file mode 100644 index 0000000..0f0819d Binary files /dev/null and b/Machine Learning/Multinomial.png differ diff --git a/Machine Learning/OpticCLuster.webp b/Machine Learning/OpticCLuster.webp new file mode 100644 index 0000000..e49cfec Binary files /dev/null and b/Machine Learning/OpticCLuster.webp differ diff --git a/Machine Learning/OpticCLuster.webp:Zone.Identifier b/Machine Learning/OpticCLuster.webp:Zone.Identifier new file mode 100644 index 0000000..e69de29 diff --git a/Machine Learning/readme.md b/Machine Learning/readme.md index 994062f..40e051d 100644 --- a/Machine Learning/readme.md +++ b/Machine Learning/readme.md @@ -1,986 +1,2728 @@ - Machine Learning(GeeksforGeeks) - - branch of Artificial Intelligence that focuses on developing models and algorithms - - Problem Statement - - Project Objectives - - Output Expectations - - Success Criteria - - Data Collection - - collection of datasets used as raw data to train model - - Data Preprocessing - - Data Cleaning - - resolve issues like: - - Missing Values - - Duplicate Data - - Outliers - - Outlier Detection Methods - - Z-Score Method - - IQR (Interquartile Range) Method - - range between the first quartile (Q1) and the third quartile (Q3) in a dataset. - - IQR = Q3(75th percentile) - Q1(25th percentile) - - - - Visualization Techniques - - Box Plots - - Scatter Plots - - Inconsistent Data - - Irrelevant Data - - Sampling Data - - Dimensionality Reduction - - Tools: - - OpenRefine - - Trifacta Wrangler - - TIBCO Clarity - - Cloudingo - - IBM InfoSphere QualityStage - - Data Preprocessing Techniques - - Normalization(Min-Max Scaling) - - Rescales features between 0 and 1. - - useful when the data does not follow a Gaussian distribution. - - sensitive to outliers, as extreme values can significantly affect the scaling. - - Good for algorithms like k-NN and neural networks. - - Standardization - - Encoding Categorical Variables - - Data Transformation - - Data Exploration: uncover insights, patterns, and relationships within the data - - Descriptive Statistics - - Data Visualization - - Correlation Analysis - - EDA: Exploratory Data Analysis - - visualize data - - understand main features, - - find patterns and - - discover how parts of data are connected. - - Types of EDA - - Univariate Analysis - - analyze single variable and - - understand its distribution and characteristics. - - e.g. histograms are used to show data distribution, - - box plots to detect outliers and understand data spread and - - bar charts for categorical data. - - Bivariate Analysis - - explore relationship between two variables - - identify correlations or associations. - - Key techniques - - Scatter Plots - - Correlation Coefficients - - Pearson’s Correlation for linear relationships between continuous variables - - Spearman’s Rank Correlation - - Cross-tabulations or Contingency Tables - - Heatmaps - - Line Graphs - - Covariance Analysis - - measures how two variables change together - - Multivariate Analysis - - examine interactions among three or more variables - - uncover complex relationships and patterns. - - techniques - - Pair Plots - - relationship between multiple variables at once - - 3D Scatter Plots - - Principal Component Analysis (PCA) - - linear technique that reduces the dimensionality of data by - - transforming the original features into a smaller set of uncorrelated features - - called principal components. - - Cluster Analysis - - Multivariate Regression - - Spatial Analysis - - used for geographical data by using maps - - and spatial plotting - - Time Series Analysis - - analyze data points collected or recorded at specific time intervals. - - Techniques - - Line Plots - - Seasonal Decomposition - - Autocorrelation Analysis + - branch of Artificial Intelligence that focuses on developing models and algorithms + - Problem Statement + - Project Objectives + - Output Expectations + - Success Criteria + - Data Collection + - collection of datasets used as raw data to train model + - Data Preprocessing + - Data Cleaning + - resolve issues like: + - Missing Values + - Duplicate Data + - Outliers + - Outlier Detection Methods + - Z-Score Method + - IQR (Interquartile Range) Method + - range between the first quartile (Q1) and the third quartile (Q3) in a dataset. + - IQR = Q3(75th percentile) - Q1(25th percentile) + - + - Visualization Techniques + - Box Plots + - Scatter Plots + - Inconsistent Data + - Irrelevant Data + - Sampling Data + - Dimensionality Reduction + - Tools: + - OpenRefine + - Trifacta Wrangler + - TIBCO Clarity + - Cloudingo + - IBM InfoSphere QualityStage + - Data Preprocessing Techniques + - Normalization(Min-Max Scaling) + - Rescales features between 0 and 1. + - useful when the data does not follow a Gaussian distribution. + - sensitive to outliers, as extreme values can significantly affect the scaling. + - Good for algorithms like k-NN and neural networks. + - Standardization + - Encoding Categorical Variables + - Data Transformation + - Data Exploration: uncover insights, patterns, and relationships within the data + - Descriptive Statistics + - Data Visualization + - Correlation Analysis + - EDA: Exploratory Data Analysis + - visualize data + - understand main features, + - find patterns and + - discover how parts of data are connected. + - Types of EDA + - Univariate Analysis + - analyze single variable and + - understand its distribution and characteristics. + - e.g. histograms are used to show data distribution, + - box plots to detect outliers and understand data spread and + - bar charts for categorical data. + - Bivariate Analysis + - explore relationship between two variables + - identify correlations or associations. + - Key techniques + - Scatter Plots + - Correlation Coefficients + - Pearson’s Correlation for linear relationships between continuous variables + - Spearman’s Rank Correlation + - Cross-tabulations or Contingency Tables + - Heatmaps + - Line Graphs + - Covariance Analysis + - measures how two variables change together + - Multivariate Analysis + - examine interactions among three or more variables + - uncover complex relationships and patterns. + - techniques + - Pair Plots + - relationship between multiple variables at once + - 3D Scatter Plots + - Principal Component Analysis (PCA) + - linear technique that reduces the dimensionality of data by + - transforming the original features into a smaller set of uncorrelated features + - called principal components. + - Cluster Analysis + - Multivariate Regression + - Spatial Analysis + - used for geographical data by using maps + - and spatial plotting + - Time Series Analysis + - analyze data points collected or recorded at specific time intervals. + - Techniques + - Line Plots + - Seasonal Decomposition + - Autocorrelation Analysis + - Moving Averages + - ARIMA Models + - ARIMA (AutoRegressive Integrated Moving Average) + - used for forecasting and understanding time series data. + - Time Series Analysis + - Concepts + - Trend + - direction in which a time series is moving + - Seasonality + - repetitive patterns or cycles that occur at regular intervals - Moving Averages - - ARIMA Models - - ARIMA (AutoRegressive Integrated Moving Average) - - used for forecasting and understanding time series data. - - Time Series Analysis - - Concepts - - Trend - - direction in which a time series is moving - - Seasonality - - repetitive patterns or cycles that occur at regular intervals - - Moving Averages - - smooth out short-term fluctuations and highlight longer-term trends - - Noise - - irregular and unpredictable components - - Differencing - - make the difference in values of a specified interval. - - Stationarity time series - - statistical properties such as mean, variance and autocorrelation remain constant over time. - - Order of Differencing - - number of times the time series data needs to be differenced to achieve stationarity. - - Autocorrelation Function (ACF) - - quantify the degree of similarity between a time series and a lagged version of itself. - - Resampling - - changing the frequency of the data observations. - - Cyclic Patterns - - Irregular Components - - Decomposition Methods - - Additive Decomposition - - Multiplicative Decomposition - - Types - - Continuous Time Series - - Data recorded at regular intervals with a continuous range of values - - like temperature, stock prices, Sensor Data, etc. - - Discrete Time Series - - recorded at specific time points like counts of events, categorical statuses, etc. - - Feature Engineering: selecting only relevant variables(features) - - create new feature or transform existing features - - Feature Creation - - generating new features from domain knowledge or by observing patterns in the data. - - Interaction Features - - Polynomial Features - - Feature Transformation - - Normalization & Scaling - - Encoding - - Mathematical transformations - - e.g. logarithmic transformations for skewed data - - Feature Selection: choosing only the most useful input features - - Filter Methods - - evaluate each feature independently with target variable - - Based on statistical measures like correlation. - - Common Filter Techniques - - Information Gain: - - Measures reduction in entropy when a feature is used. - - Chi-square test: - - Checks the relationship between categorical features. - - Fisher’s Score: - - Ranks features based on class separability. - - Pearson’s Correlation Coefficient: - - Measures linear relationship between two continuous variables. - - Variance Threshold: - - Removes features with very low variance. - - Mean Absolute Difference: - - Similar to variance threshold but uses absolute differences. - - Dispersion ratio: - - Ratio of arithmetic mean to geometric mean; - - higher values indicate useful features. - - Wrapper Methods - - greedy algorithms that train algorithm - - use different combination of features and compute relation between these subset features and target variable - - Common Wrapper Techniques - - Forward Selection - - start with no features - - add features one by one based on model performance improvement. - - Backward Elimination - - start with all features - - remove least significant features iteratively. - - Recursive Feature Elimination (RFE) - - recursively removes least important features - - Select based on model performance. - - Embedded Methods - - Feature selection integrated within model training. - - Combine the benefits of both filter and wrapper methods - - Common Embedded Techniques - - Lasso Regression (L1 Regularization) - - Keeps only features with non-zero coefficients. - - adds penalty equal to the absolute value of the magnitude of coefficients to prevent overfitting. - - can shrink some coefficients to zero, effectively performing feature selection. - - J(λ) = Loss + λ * Σ|coefficients| - - Ridge Regression (L2 Regularization) - - adds penalty equal to the square of the magnitude of coefficients. - - penalize large coefficient in linear regression equation - - does not perform feature selection but helps in reducing multicollinearity. - - J(λ) = Loss + λ * Σ(coefficients)² - - Elastic Net Regression - - combines L1 and L2 regularization - - balances between feature selection and coefficient shrinkage. - - J(λ1, λ2) = Loss + λ1 * Σ|coefficients| + λ2 * Σ(coefficients)² - - Decision Trees and Random Forests - - provide feature importance scores based on how much each feature contributes to - - reducing impurity. - - Gradient Boosting Machines (GBM) - - similar to Random Forests - - can be used to assess feature importance. - - pick features that reduce prediction error the most. - - Feature Scaling: - - standardize the range of independent variables or features of data. - - Scaling Techniques - - Absolute Maximum Scaling - - rescales each feature by dividing all values by the maximum absolute value of that feature - - feature values fall within the range of -1 to 1 - - highly sensitive to outliers - - Min-Max Scaling(Normalization) - - transforms by subtracting the minimum value and - - dividing by the difference between the maximum and minimum values. - - rescales the values to a specified range, typically between 0 and 1. - - preserving the original distribution shape - - but is still affected by outliers - - Vector Normalization - - scales each data sample (row) such that its vector length (Euclidean norm) is 1 - - focuses on the direction of data points rather than magnitude - - Standardization (Z-score Normalization) - - transforms the values to have a mean of 0 and a standard deviation of 1. - - centers the data around the mean and scales it based on the standard deviation. - - useful when the data follows a Gaussian distribution. - - less sensitive to outliers compared to Min-Max Scaling. - - preserves the shape of the original distribution of the data. - - subtracting the mean and scales them by dividing by the standard deviation - - normal distribution often benefits models like linear regression, - - logistic regression and neural networks by improving convergence speed and stability. - - Z = (X - μ) / σ - - X = Data - μ = Mean value of X - σ = Standard deviation of X - - Robust Scaling - - uses the median and interquartile range (IQR) instead of the mean and standard deviation - - transformation robust to outliers and skewed distributions. - - suitable when the dataset contains extreme values or noise. - - Reduces influence of outliers by centering on median - - Feature Extraction - - transforming raw data into a simplified and informative set of features or attributes. - - reduces data complexity and highlights the most relevant information - - Techniques - - Statistical Methods - - Mean: average value of a dataset. - - Median: middle value when data is ordered. - - Mode: most frequently occurring value. - - Standard Deviation: measure of data dispersion around the mean. - - corelation and covariance: measure relationships between variables. - - Regression Analysis: understand relationships between dependent and independent variables. - - Variance: measure of data spread. - - Skewness: measure of data asymmetry. - - Kurtosis: measure of data "tailedness". - - Dimensionality Reduction Techniques - - Principal Component Analysis (PCA) - - select variables that account for most of the data’s variation - - reduces dimensionality while preserving important information - - Linear Discriminant Analysis (LDA) - - find the best combination of features to separate different classes - - t-Distributed Stochastic Neighbor Embedding (t-SNE) - - reduce high-dimensional data into two or three dimensions - - UMAP (Uniform Manifold Approximation and Projection) - - similar to t-SNE but faster and scalable - - preserves both local and global data structure - - Independent Component Analysis (ICA) - - separates a multivariate signal into additive, independent components - - useful in signal processing and image analysis. - - Aggregation & Combination - - Summing or averaging features - - for Text Data - - Bag of Words (BoW) - - Represents a document by counting word frequencies, ignoring word order, - - useful for basic text classification. - - Term Frequency-Inverse Document Frequency (TF-IDF) - - Adjust word importance based on frequency - - in a specific document compared to all documents - - highlighting unique terms. - - Word Embeddings (Word2Vec, GloVe) - - Signal Processing Methods: used for analyzing time-series, audio and sensor data - - Fourier Transform - - convert signal from the time domain to the frequency domain to analyze its frequency components. - - Wavelet Transform - - analyzes signals that vary over time, - - offering both time and frequency information for non-stationary signals. - - Short-Time Fourier Transform (STFT) - - Image Processing Methods - - Histogram of Oriented Gradients (HOG) - - finds the distribution of intensity gradients or edge directions in an image - - used in object detection and recognition tasks. - - feature descriptor for object detection in images - - captures edge and gradient structures. - - Convolutional Neural Networks (CNN) Features - - learn hierarchical features from images through layers of convolutions, - - ideal for classification and detection tasks. - - Scale-Invariant Feature Transform (SIFT) - - detects and describes local features in images - - robust to scale, rotation, and illumination changes. - - Speeded-Up Robust Features (SURF) - - faster alternative to SIFT for detecting and describing image features. - - tools for Feature Extraction - - TensorFlow / Keras - - PyTorch - - NLTK(Natural Language Toolkit) - - FeatureTools - - Tsfresh - - OpenCV - - Scikit-Image - - step includes - - Data Cleaning - - handling missing values, - - correct errors or inconsistencies - - Data Transformation - - raw data into a format suitable for modeling including scaling, normalization and encoding. - - scaling numbers, - - creating new features - - or combining existing ones. - - Feature Engineering Techniques - - One-Hot Encoding - - convert categorical variables into binary indicators - - Binning - - transform continuous variables into discrete bins - - e.g labels = ['0-20', '21-40', '41-60', '61+'] - - Text Data Preprocessing - - removing stop-words - - filtering out commonly occurring words that - - provide no or very little semantic value to text analysis. - - e.g. "is", "the", "and" - - Categories of Stopwords - - Standard Stopwords - - articles("a", "the"), - - conjunctions ("and", "but") - - and prepositions ("in", "on") - - Domain-Specific Stopwords - - Contextual Stopwords - - Words with extremely high frequency - - Numerical Stopwords - - Digits, punctuation marks and single characters - - Stemming - - reducing words to their root form - - remove prefixes and suffixes. - - e.g. "running", "runner", and "ran" to "run" - - Stemming Algorithms - - Porter Stemmer - - rule-based algorithm - - removes common morphological and inflectional endings from words in English. - - suffix EED to EE - - 'agreed' → 'agree' - - suffix ED or ING to null - - e.g 'playing' → 'play', 'played' → 'play' - - suffix ATION to ATE - - e.g 'relational' → 'relate' - - suffix TIONAL to TION - - e.g 'conditional' → 'condition' - - suffix FUL to null - - e.g 'hopeful' → 'hope' - - Snowball Stemmer(Porter2) - - enhanced version of the Porter Stemmer - - supports multiple languages(multilingual stemmer) - - Example: - - 'running' → 'run' - - 'quickly' → 'quick' - - Lancaster Stemmer - - more aggressive and faster than other stemmers. - - also more destructive and may lead to excessively shortened stems - - Regexp Stemmer - - Regular Expression Stemmer - - allows users to define custom rules using regular expressions (regex). - - 'running' → 'runn' - Custom rule: r'ing$' removes the suffix ing. - - Krovetz Stemmer - - more linguistically accurate - - preserve meaning more effectively - - steps like converting plural forms to singular and removing ing from past-tense verbs. - - 'children' → 'child' - - Lemmatization - - Reduces words to their base form (lemma) ensuring a valid word. - - Considers the word's meaning and context to return the base form. - - Always produces a valid word. - - Example: "Better" → "good" - - Considers the context and part of speech. - - Vectorizing text data - - transform human language into a format that machines can comprehend and process - - numerical representations - - enable computers to perform tasks such as - - sentiment analysis, - - machine translation and - - information retrieval - - Vectors are numerical representations of words, phrases or entire documents. - - Techniques - - One Hot Encoding - - each word is represented by a vector with a high bit corresponding to - - the word’s index in the vocabulary - - all other bits set to zero. - - Bag of Words (BoW) - - converts text into a vector - - representing the frequency of words, disregarding grammar and word order. - - Ignores the order and context of words. - - Results in high-dimensional and sparse matrices. - - Term Frequency-Inverse Document Frequency (TF-IDF) - - extension of BoW that weighs the frequency of words by their importance across documents. - - reduces the weight of common words and increases the weight of rare but significant words. - - TF(t,d)= - Total number of terms in document d/Number of times term t appears in document d - - IDF(t)= log_e(Total number of documents/Number of documents with term t in it) - - Count Vectorizer - - represents text by counting the occurrences of each word in a document. - - similar to BoW but focuses on raw counts rather than binary presence. - - converts a collection of text documents to a matrix of token counts - - Word Embeddings (Word2Vec, GloVe) - - dense vector representations of words in a continuous vector space - - semantically similar words are located closer to each other. - - Image Embeddings - - transforms images into numerical representations - - Convolutional Neural Networks (CNNs) - - extract high-level features from images - - represent them as dense vectors. - - Feature Splitting - - Divide single feature into multiple sub-features, - - uncovering valuable insights - - Feature Engineering tools - - FeatureTools - - Tsfresh - - Pandas - - NumPy - - TPOT - - DataRobot - - Alteryx - - H2O.ai - - Model Selection - - Choosing appropriate algorithms based on: - - Problem Type - - Data Characteristics - - Computational Resources - - Model Training - - Process - - Iterative training - - Optimization - - Validation - - Model Types - - Supervised Learning - - Trains models on labeled data to predict or classify new, unseen data. - - Unsupervised Learning - - Finds patterns or groups in unlabeled data, like clustering or dimensionality reduction. - - Reinforcement Learning - - Learns through trial and error to maximize rewards, ideal for decision-making tasks. - - Self-Supervised Learning - - Utilizes large amounts of unlabeled data by - - generating own labels from the data itself - - Semi-Supervised Learning - - Combines a small amount of labeled data with a large amount of unlabeled data - - Model Evaluation - - Regularization Techniques: prevent overfitting by adding penalty to model complexity - - L1 Regularization (Lasso): Least Absolute Shrinkage and Selection Operator regression - - add the absolute value of magnitude of the coefficient as a penalty term to the loss function(L) - - L = Loss + λ * Σ|coefficients| - - can shrink some coefficients to zero, effectively performing feature selection. - - L2 Regularization (Ridge) - - add the squared magnitude of the coefficient as a penalty term to the loss function(L). - - handle multicollinearity by shrinking the coefficients of correlated features instead of eliminating them. - - L = Loss + λ * Σ(coefficients)² - - Elastic Net Regression - - combines L1 and L2 regularization - - add the absolute norm of the weights as well as the squared measure of the weights - - balances between feature selection and coefficient shrinkage. - - L = Loss + λ1 * Σ|coefficients| + λ2 * Σ(coefficients)² - - Dropout - - Confusion Matrix - - True Positive - - When the model correctly predicts the positive class. - - True Negative - - when the model correctly predicts the negative class. - - False Positive - - when the model incorrectly predicts the positive class for a negative instance. - - False Negative - - when the model incorrectly predicts the negative class for a positive instance. - - Cross-Validation - - how well model performs on unseen data while preventing overfitting - - Train model on some parts and test it on the remaining part of dataset - - Techniques - - Holdout Validation - - 50% data for training and 50% for testing. - - LOOCV (Leave One Out Cross Validation) - - model is trained on the entire dataset except for one data point which is used for testing. - - repeated for each data point in the dataset. - - computationally expensive for large datasets. - - K-Fold Cross-Validation - - split the dataset into k equal-sized folds. - - train model on k-1 folds and tested on the remaining fold. - - process is repeated k times, with each fold used as the test set once. - - final performance is averaged over all k iterations. - - k should be greater than or equal to 5 and less than or equal to 10. - - Stratified K-Fold Cross-Validation - - ensures each fold of the cross-validation process has the same class distribution as the full dataset. - - important for imbalanced datasets - - Leave-One-Out Cross-Validation (LOOCV) - - Shuffle Split Cross-Validation - - Performance Metrics - - Metrics - - features like: - - Accuracy - - ratio of correctly predicted instances to the total instances in a dataset. - - can be misleading when one class is more dominant over the other - - Accuracy = (TP + TN) / (TP + TN + FP + FN) - - Precision - - ratio of true positive predictions to the total predicted positives. - - measures the accuracy of positive predictions. - - important in scenarios where false positives are costly. - - like spam detection, fraud detection etc. - - Precision = TP / (TP + FP) - - Recall - - Recall (Sensitivity) is the ratio of true positive predictions to the total actual positives. - - measures how how good the model is at predicting positives. - - important in scenarios where false negatives are costly. - - like disease diagnosis, safety-critical systems etc. - - Recall = TP / (TP + FN) - - F1-Score - - F1-Score is the harmonic mean of Precision and Recall, - - providing a balance between the two metrics. - - useful when dealing with imbalanced datasets - - where one class is significantly more prevalent than the other. - - F1 = 2 * (Precision * Recall) / (Precision + Recall) - - Specificity(True Negative Rate) - - ratio of true negative predictions to the total actual negatives. - - measures how well the model identifies negative instances. - - important in scenarios where false positives are costly. - - like spam detection, fraud detection etc. - - Specificity = TN / (TN + FP) - - 1 - Specificity = False Positive Rate (FPR) - - Type I Error (False Positive Rate) - - occurs when the model incorrectly predicts the positive class for a negative instance. - - Type I Error Rate = FP / (FP + TN) - - Type II Error (False Negative Rate) - - occurs when the model incorrectly predicts the negative class for a positive instance. - - Type II Error Rate = FN / (FN + TP) - - Mean Absolute Error (MAE) - - average of the absolute differences between predicted and actual values. - - provides a straightforward measure of prediction accuracy. - - MAE = (1/n) * Σ|predicted - actual| - - Mean Squared Error (MSE) - - average of the squared differences between predicted and actual values. - - penalizes larger errors more than MAE. - - MSE = (1/n) * Σ(predicted - actual)² - - Root Mean Squared Error (RMSE) - - square root of the average of the squared differences between predicted and actual values. - - provides an interpretable measure of prediction accuracy in the same units as the target variable. - - RMSE = sqrt((1/n) * Σ(predicted - actual)²) - - R-squared (Coefficient of Determination) - - indicates the proportion of variance in the dependent variable that can be explained by the independent variables. - - R² = 1 - (SS_res / SS_tot) - - SS_res = Σ(actual - predicted)² - - SS_tot = Σ(actual - mean(actual))² - - ROC-AUC Curve - - ROC-AUC (Receiver Operating Characteristic - Area Under the Curve) - - Check how well a binary classification model works - - how well the model separates the positive cases like people with a disease from the - - negative cases like people without the disease at different threshold level. - - measures a model's ability to distinguish between classes, - - with higher values indicating better performance. - - True Positive Rate (TPR) vs False Positive Rate (FPR) - - True Positive Rate (TPR) = TP / (TP + FN) = Recall - - False Positive Rate (FPR) = FP / (FP + TN) = 1 - Specificity - - Hyperparameter Model Tuning: optimize model performance by finding the best hyperparameters - - Grid Search CV - - brute-force approach - - train model using all possible combinations of specified hyperparameter values to - - find the best-performing setup. - - slow and uses a lot of computer power - - exhaustively searches through a specified subset of hyperparameters - - evaluates model performance for each combination - - computationally expensive for large hyperparameter spaces - - example: - - Tuning Logistic Regression with GridSearchCV - - Random Search CV - - randomly samples hyperparameter combinations from a given range - - evaluates model performance for each combination - - more efficient than Grid Search for large hyperparameter spaces - - Example: - - Tuning Random Forest with RandomizedSearchCV - - Tuning Decision Tree with RandomizedSearchCV - - Bayesian Optimization - - treats hyperparameter tuning like a mathematical optimization problem and - - learns from past results to decide what to try next. - - builds a probabilistic model(surrogate function) of the objective function - - uses it to select the most promising hyperparameter combinations to evaluate - - balances exploration and exploitation - - P(score(y)∣hyperparameters(x)) - - surrogate function models the relationship between hyperparameters x and the score y. - - surrogate models used in Bayesian optimization include: - - Gaussian Processes - - Random Forests - - Tree-structured Parzen Estimators (TPE) - - Genetic Algorithms - - inspired by the process of natural selection - - iteratively evolves a population of hyperparameter combinations - - selects the best-performing combinations for the next generation - - Model Deployment - - Containerization - - Docker - - Kubernetes - - Cloud Services - - AWS SageMaker - - Google AI Platform - - Microsoft Azure ML - - Model Monitoring and Maintenance - - Model Drift Detection - - Performance Monitoring - - Monitoring issues like: - - Data Drift - - Data Drift occurs when the - - statistical properties of the input data change over time, - - leading to a decline in model performance. - - Concept Drift - - Concept Drift refers to changes in the - - underlying relationships between input features and target variables over time, - - which can result in decreased model accuracy. - - Algorithms - - Regression - - Linear Regression - - assumes that there is a linear relationship between the input and output - - relationship represented by a straight line. - - uses the equation of a line: Y = mX + b - - Y = predicted value (output) - - m = slope of the line (coefficient) - - X = input feature - - b = y-intercept (constant term) - - Simple Linear Regression - - involves a single independent variable to predict a dependent variable. - - equation: Y = β0 + β1X1 - - Y = predicted value - - β0 = intercept - - β1 = coefficient for the feature - - X1 = input feature - - Multiple Linear Regression - - involves two or more independent variables to predict a dependent variable. - - equation: Y = β0 + β1X1 + β2X2 + ... + βnXn - - Y = predicted value - - β0 = intercept - - β1, β2, ..., βn = coefficients for each feature - - X1, X2, ..., Xn = input features - - Multicollinearity - - occurs when two or more independent variables in a regression model are highly correlated. - - can lead to unstable coefficient estimates and make it difficult to assess the individual effect of each predictor. - - Detection Methods - - Variance Inflation Factor (VIF) - - quantifies how much the variance of a regression coefficient is increased due to multicollinearity. - - VIF = 1 / (1 - R²) - - R² = coefficient of determination from regressing the predictor against all other predictors. - - A VIF value greater than 5 or 10 indicates high multicollinearity. - - Correlation Matrix - - examines pairwise correlations between independent variables. - - high correlation coefficients (e.g., above 0.8 or below -0.8) suggest multicollinearity. - - Condition Index - - assesses the sensitivity of the regression estimates to small changes in the data. - - values above 30 indicate potential multicollinearity issues. - - Remedies - - Remove one of the correlated variables - - Combine correlated variables into a single predictor using techniques like Principal Component Analysis (PCA) - - Regularization methods like Ridge Regression or Lasso Regression - - Polynomial Regression - - models the relationship between the independent variable and dependent variable as an nth degree polynomial. - - captures non-linear relationships by introducing polynomial terms of the independent variable. - - use cases - - House prices forcasting based on features like size, location, and number of bedrooms. - - Sales forcasting based on advertising spend and market trends. - - Estimating the impact of temperature on electricity consumption. - - Stock price prediction based on historical data and market indicators. - - Medical risk prediction based on patient health metrics. - - Minimizing the error - - Ordinary Least Squares (OLS) - - minimizes the sum of squared differences between observed and predicted values. - - Residuals = Observed - Predicted - - Cost Function (SSE) = Σ(Residuals)² - - Hypothesis in Linear Regression - - represents the predicted output based on input features and model parameters. - - Single linear Regression Hypothesis: - - h(X) = β0 + β1X1 - - h(X) = predicted value - - β0 = intercept - - β1 = coefficient for the feature - - X1 = input feature - - Multiple linear Regression Hypothesis: - - h(X) = β0 + β1X1 + β2X2 + ... + βnXn - - h(X) = predicted value - - β0 = intercept - - β1, β2, ..., βn = coefficients for each feature - - X1, X2, ..., Xn = input features - - Cost Function - - Mean Squared Error (MSE) - - MSE = (1/n) * Σ(predicted - actual)² - - Root Mean Squared Error (RMSE) - - RMSE = sqrt((1/n) * Σ(predicted - actual)²) - - Mean Absolute Error (MAE) - - MAE = (1/n) * Σ|predicted - actual| - - Gradient Descent - - find best fit line for the data - - iterative optimization algorithm - - minimizes the cost function by updating model parameters in the direction of the steepest descent. - - minimize the prediction error - - start with random model parameters and - - repeatedly adjust them to reduce the difference between predicted and actual values. - - cost function = MSE = (1/n) * Σ(predicted - actual)² - - Gradient computation: - - for MSE cost function, the gradient with respect to βj is: - - (∂/∂βj) * MSE = (2/n) * Σ(predicted - actual) * Xj - - Xj = input feature corresponding to βj - - update rule: - - βj = βj - α * (∂/∂βj) * Cost Function - - βj = model parameter (coefficient) - - α = learning rate (step size) - - (∂/∂βj) * Cost Function = gradient of the cost function with respect to βj - - Gradient Descent Variants for Linear Regression - - Batch Gradient Descent - - Stochastic Gradient Descent (SGD) - - Mini-Batch Gradient Descent - - Evaluation Metrics for Linear Regression - - R-squared (Coefficient of Determination) - - indicates how much variation the developed model can explain or capture - - R² = 1 - (SS_res / SS_tot) - - SS_res = Σ(actual - predicted)² - - SS_tot = Σ(actual - mean(actual))² - - Residual Standard Error (RSE) - - measures the average amount that the observed values deviate from the predicted values. - - RSE = sqrt(SS_res / (n - p - 1)) - - SS_res = Σ(actual - predicted)² - - n = number of observations - - p = number of predictors - - Residual sum of squares (RSS) - - RSS = Σ(actual - predicted)² - - Total sum of squares (TSS) - - TSS = Σ(actual - mean(actual))² - - Adjusted R-squared Error - - Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - p - 1)] - - n = number of observations - - p = number of predictors - - Mean Absolute Error (MAE) - - MAE = (1/n) * Σ|predicted - actual| - - Mean Squared Error (MSE) - - average of the squared differences between the actual and predicted values for all the data points - - gives higher weight to larger errors. - - MSE = (1/n) * Σ(predicted - actual)² - - Root Mean Squared Error (RMSE) - - RMSE = sqrt((1/n) * Σ(predicted - actual)²) - - Logistic Regression - - Used when the output is a "yes or no" type answer - - helps in predicting categories like pass/fail or spam/not spam. - - predicts binary outcomes (0 or 1) based on input features. - - used for classification problems - - uses the logistic function (sigmoid function) to model the probability of the positive class. - - Logistic Function (Sigmoid Function) - - S curve - - S(t) = 1 / (1 + e^(-t)) - - S(t) = predicted probability of the positive class - - e = Euler's number (approximately 2.71828) - - t = linear combination of input features and model parameters - - Likelihood function for Logistic Regression - - L(β) = Π P(y_i | X_i; β) - - P(y_i | X_i; β) = S(X_i * β) if y_i = 1 - - P(y_i | X_i; β) = 1 - S(X_i * β) if y_i = 0 - - y_i = actual label (0 or 1) - - X_i = input features for the i-th instance - - β = model parameters (coefficients) - - Log likelihood function - - LL(β) = Σ [y_i * log(S(X_i * β)) + (1 - y_i) * log(1 - S(X_i * β))] - - Gradient of the Log Likelihood - - ∂LL(β) / ∂β = Σ (y_i - S(X_i * β)) * X_i - - Odds and Log-Odds - - Odds = P / (1 - P) - - P = probability of the positive class - - Log-Odds (Logit) = log(P / (1 - P)) - - Cost Function - - Logistic Regression uses the Log Loss (Cross-Entropy Loss) as its cost function. - - Log Loss(cost) = -[y * log(p) + (1 - y) * log(1 - p)] - - y = actual label (0 or 1) - - p = predicted probability of the positive class - - Model Training - - Maximum Likelihood Estimation (MLE) - - finds the model parameters that maximize the likelihood of observing the given data. - - Types of Logistic Regression - - Binary Logistic Regression - - Two classes (0 and 1) - - e.g. spam detection, disease diagnosis etc. - - Multinomial Logistic Regression - - more than two classes without any order - - e.g. classifying types of fruits (apple, banana, orange) - - Ordinal Logistic Regression - - more than two classes with a specific order - - e.g. rating scales (poor, average, good, excellent) - - Polynomial Regression - - Decision Trees - - supervised learning algorithm used for both classification and regression tasks - - tree-like model of decisions and their possible consequences. - - splits data into subsets based on feature values - - recursively partitions the data to create a tree structure. - - like a flowchart to help make decisions based on input features. - - Components - - Root Node - - branches: attribute values/outcomes of a test - - Internal Nodes: attribute tests - - Leaf Nodes: Final decision o predictions(class labels or continuous values) - - Applications - - Customer Segmentation - - Fraud Detection - - Medical Diagnosis - - Loan Approval - - Splitting Criteria - - Gini Index/Impurity - - measures how often a randomly chosen element would be incorrectly labeled - - if it was randomly labeled according to the distribution of labels in the subset. - - i.e. attribute with a lower Gini index should be preferred - - IGini = 1 - Σ(p_i)² - - p_i = proportion of instances belonging to class i - - Information Gain - - it tells how useful a question (or feature) is for splitting data into groups. - - measures how much the uncertainty decreases after the split. - - Information Gain = Entropy(Parent) - Weighted Average * Entropy(Children) - - Mean Squared Error (MSE) for regression tasks - - Gain(G, A) = Entropy(G) - Σ (|G_v| / |G|) * Entropy(G_v) - - G = dataset before the split - - A = feature used for splitting - - G_v = subset of G where feature A has value v - - Decision Tree Algorithms - - ID3 (Iterative Dichotomiser 3) - - greedily choosing the feature that maximizes the information gain at each node - - uses entropy as the splitting criterion - - Entropy measures impurity in the dataset - - Entropy(S) = - Σ p_i * log2(p_i) - - p_i = proportion of instances belonging to class i - - Information Gain - - IG(S, A) = Entropy(S) - Σ (|S_v| / |S|) * Entropy(S_v) - - S = dataset before the split - - A = feature used for splitting - - S_v = subset of S where feature A has value v - - it recursively splits the dataset using the feature with the highest information gain - - until all examples in a node belong to the same class or no features remain to split. - - prone to overfitting - - C4.5 - - modified version of information gain called the gain ratio - - to reduce the bias towards features with many values - - Gain Ratio = Information Gain / Split Information - - Split Information = - Σ (|S_v| / |S|) * log2(|S_v| / |S|) - - handles both continuous and categorical features - - manages missing values effectively - - prunes the tree after creation to reduce overfitting - - struggles with large datasets and high-dimensional data and noisy datasets. - - CART (Classification and Regression Trees) - - for both classification and regression tasks - - uses Gini impurity as the splitting criterion for classification tasks - - which measures the impurity(likelihood of incorrect classification) of a dataset. - - Gini Impurity = 1 - Σ(p_i)² - - p_i = proportion of instances belonging to class i - - for regression tasks, it uses Mean Squared Error (MSE) to minimize the variance within each node. - - produces binary trees, where each internal node has exactly two children. - - employs cost-complexity pruning to avoid overfitting - - handles both numerical and categorical data effectively. - - uses cost-complexity pruning after tree construction to reduce overfitting - - builds binary trees - - CHAID (Chi-squared Automatic Interaction Detector) - - uses chi-square tests to determine the best splits especially for categorical variables. - - chi-square tests - - find relationship between two entities. - - used to determine whether observed frequencies differ significantly from expected frequencies - - under given hypothesis. + - smooth out short-term fluctuations and highlight longer-term trends + - Noise + - irregular and unpredictable components + - Differencing + - make the difference in values of a specified interval. + - Stationarity time series + - statistical properties such as mean, variance and autocorrelation remain constant over time. + - Order of Differencing + - number of times the time series data needs to be differenced to achieve stationarity. + - Autocorrelation Function (ACF) + - quantify the degree of similarity between a time series and a lagged version of itself. + - Resampling + - changing the frequency of the data observations. + - Cyclic Patterns + - Irregular Components + - Decomposition Methods + - Additive Decomposition + - Multiplicative Decomposition + - Types + - Continuous Time Series + - Data recorded at regular intervals with a continuous range of values + - like temperature, stock prices, Sensor Data, etc. + - Discrete Time Series + - recorded at specific time points like counts of events, categorical statuses, etc. + - Feature Engineering: selecting only relevant variables(features) + - create new feature or transform existing features + - Feature Creation + - generating new features from domain knowledge or by observing patterns in the data. + - Interaction Features + - Polynomial Features + - Feature Transformation + - Normalization & Scaling + - Encoding + - Mathematical transformations + - e.g. logarithmic transformations for skewed data + - Feature Selection: choosing only the most useful input features + - Filter Methods + - evaluate each feature independently with target variable + - Based on statistical measures like correlation. + - Common Filter Techniques + - Information Gain: + - Measures reduction in entropy when a feature is used. + - Chi-square test: + - Checks the relationship between categorical features. + - Fisher’s Score: + - Ranks features based on class separability. + - Pearson’s Correlation Coefficient: + - Measures linear relationship between two continuous variables. + - Variance Threshold: + - Removes features with very low variance. + - Mean Absolute Difference: + - Similar to variance threshold but uses absolute differences. + - Dispersion ratio: + - Ratio of arithmetic mean to geometric mean; + - higher values indicate useful features. + - Wrapper Methods + - greedy algorithms that train algorithm + - use different combination of features and compute relation between these subset features and target + variable + - Common Wrapper Techniques + - Forward Selection + - start with no features + - add features one by one based on model performance improvement. + - Backward Elimination + - start with all features + - remove least significant features iteratively. + - Recursive Feature Elimination (RFE) + - recursively removes least important features + - Select based on model performance. + - Embedded Methods + - Feature selection integrated within model training. + - Combine the benefits of both filter and wrapper methods + - Common Embedded Techniques + - Lasso Regression (L1 Regularization) + - Keeps only features with non-zero coefficients. + - adds penalty equal to the absolute value of the magnitude of coefficients to prevent + overfitting. + - can shrink some coefficients to zero, effectively performing feature selection. + - J(λ) = Loss + λ * Σ|coefficients| + - Ridge Regression (L2 Regularization) + - adds penalty equal to the square of the magnitude of coefficients. + - penalize large coefficient in linear regression equation + - does not perform feature selection but helps in reducing multicollinearity. + - J(λ) = Loss + λ * Σ(coefficients)² + - Elastic Net Regression + - combines L1 and L2 regularization + - balances between feature selection and coefficient shrinkage. + - J(λ1, λ2) = Loss + λ1 * Σ|coefficients| + λ2 * Σ(coefficients)² + - Decision Trees and Random Forests + - provide feature importance scores based on how much each feature contributes to + - reducing impurity. + - Gradient Boosting Machines (GBM) + - similar to Random Forests + - can be used to assess feature importance. + - pick features that reduce prediction error the most. + - Feature Scaling: + - standardize the range of independent variables or features of data. + - Scaling Techniques + - Absolute Maximum Scaling + - rescales each feature by dividing all values by the maximum absolute value of that feature + - feature values fall within the range of -1 to 1 + - highly sensitive to outliers + - Min-Max Scaling(Normalization) + - transforms by subtracting the minimum value and + - dividing by the difference between the maximum and minimum values. + - rescales the values to a specified range, typically between 0 and 1. + - preserving the original distribution shape + - but is still affected by outliers + - Vector Normalization + - scales each data sample (row) such that its vector length (Euclidean norm) is 1 + - focuses on the direction of data points rather than magnitude + - Standardization (Z-score Normalization) + - transforms the values to have a mean of 0 and a standard deviation of 1. + - centers the data around the mean and scales it based on the standard deviation. + - useful when the data follows a Gaussian distribution. + - less sensitive to outliers compared to Min-Max Scaling. + - preserves the shape of the original distribution of the data. + - subtracting the mean and scales them by dividing by the standard deviation + - normal distribution often benefits models like linear regression, + - logistic regression and neural networks by improving convergence speed and stability. + - Z = (X - μ) / σ + - X = Data + μ = Mean value of X + σ = Standard deviation of X + - Robust Scaling + - uses the median and interquartile range (IQR) instead of the mean and standard deviation + - transformation robust to outliers and skewed distributions. + - suitable when the dataset contains extreme values or noise. + - Reduces influence of outliers by centering on median + - Feature Extraction + - transforming raw data into a simplified and informative set of features or attributes. + - reduces data complexity and highlights the most relevant information + - Techniques + - Statistical Methods + - Mean: average value of a dataset. + - Median: middle value when data is ordered. + - Mode: most frequently occurring value. + - Standard Deviation: measure of data dispersion around the mean. + - corelation and covariance: measure relationships between variables. + - Regression Analysis: understand relationships between dependent and independent variables. + - Variance: measure of data spread. + - Skewness: measure of data asymmetry. + - Kurtosis: measure of data "tailedness". + - Dimensionality Reduction Techniques + - Principal Component Analysis (PCA) + - select variables that account for most of the data’s variation + - reduces dimensionality while preserving important information + - Linear Discriminant Analysis (LDA) + - find the best combination of features to separate different classes + - t-Distributed Stochastic Neighbor Embedding (t-SNE) + - reduce high-dimensional data into two or three dimensions + - UMAP (Uniform Manifold Approximation and Projection) + - similar to t-SNE but faster and scalable + - preserves both local and global data structure + - Independent Component Analysis (ICA) + - separates a multivariate signal into additive, independent components + - useful in signal processing and image analysis. + - Aggregation & Combination + - Summing or averaging features + - for Text Data + - Bag of Words (BoW) + - Represents a document by counting word frequencies, ignoring word order, + - useful for basic text classification. + - Term Frequency-Inverse Document Frequency (TF-IDF) + - Adjust word importance based on frequency + - in a specific document compared to all documents + - highlighting unique terms. + - Word Embeddings (Word2Vec, GloVe) + - Signal Processing Methods: used for analyzing time-series, audio and sensor data + - Fourier Transform + - convert signal from the time domain to the frequency domain to analyze its frequency + components. + - Wavelet Transform + - analyzes signals that vary over time, + - offering both time and frequency information for non-stationary signals. + - Short-Time Fourier Transform (STFT) + - Image Processing Methods + - Histogram of Oriented Gradients (HOG) + - finds the distribution of intensity gradients or edge directions in an image + - used in object detection and recognition tasks. + - feature descriptor for object detection in images + - captures edge and gradient structures. + - Convolutional Neural Networks (CNN) Features + - learn hierarchical features from images through layers of convolutions, + - ideal for classification and detection tasks. + - Scale-Invariant Feature Transform (SIFT) + - detects and describes local features in images + - robust to scale, rotation, and illumination changes. + - Speeded-Up Robust Features (SURF) + - faster alternative to SIFT for detecting and describing image features. + - tools for Feature Extraction + - TensorFlow / Keras + - PyTorch + - NLTK(Natural Language Toolkit) + - FeatureTools + - Tsfresh + - OpenCV + - Scikit-Image + - step includes + - Data Cleaning + - handling missing values, + - correct errors or inconsistencies + - Data Transformation + - raw data into a format suitable for modeling including scaling, normalization and encoding. + - scaling numbers, + - creating new features + - or combining existing ones. + - Data Handling + - handling missing values, + - correct errors or inconsistencies + - collect, organize, analyze, and present data to extract useful information and support decision-making + - Data Handling Steps + - Data Gathering + - Recording + - Organizing + - Analyzing + - Types of Data + - Quantitative Data + - Numerical data that can be measured and quantified. + - Examples: age, income, temperature, etc. + - Types + - Discrete Data + - countable values, often integers. + - e.g. number of students in a class, number of cars in a parking lot, etc. + - Continuous Data + - can take any value within a range, including decimals. + - e.g. height, weight, time, etc. + - Qualitative Data + - Categorical data that describes qualities or characteristics. + - Examples: gender, color, type of product, etc. + - Graphical Representation of Data + - Bar Charts/Graphs + - graphical representation of any quantity, number, or data in the form of bars + - used to compare different categories or groups of data. + - Types of Bar Charts + - Vertical Bar Chart + - the bars of grouped data in vertical bar graphs lie vertically + - Horizontal Bar Chart + - rectangular bars lie horizontally + - frequency of the data is represented on the x-axis + - categories of data are shown on the y-axis. + - Double Bar Graph + - two sets of data are represented by two bars for each category + - allows for easy comparison between the two datasets. + - used when multiple data points need to be represented + - e.g. comparing sales of two products across different regions. + - Stacked Bar Chart + - Grouped Bar Chart + - multiple sets of data items are compared, + - with a single color used to denote a specific series across all sets. + - also called the clustered bar graph. + - Subdivided Bar Chart + - color coded segments within each bar to represent subcategories or components of the data. + - Percentage Bar Chart + - represent data in the form of percentages + - Broken-Scale Bar Chart + - scale on an axis is intentionally broken to show both very large and small values clearly. + - Deviation Bar Chart + - bars are drawn on both sides of a baseline (usually zero) to show positive and negative + values. + - Bars above the baseline represent positive deviations, while bars below represent negative + deviations from a standard or average value. + - useful for visualizing differences or changes in data relative to a reference point. + - Histograms + - shows the distribution of continuous numerical data. + - data is divided into intervals (bins) and the frequency of data points in each bin is represented + by the height of the bars. + - useful for understanding the shape, central tendency, and variability of the data distribution. + - Types of Histograms + - Uniform Histogram + - basic histogram with equal-width bins. + - Bimodal Histogram + - shows two distinct peaks in the data distribution. + - Symmetric Histogram + - data is evenly distributed around the mean, creating a bell-shaped curve. + - Right Skewed Histogram + - data is skewed to the right, with a longer tail on the right side of the distribution. + - Left Skewed Histogram + - data is skewed to the left, with a longer tail on the left side of the distribution. + - Frequency Histogram + - shows the frequency of data points in each bin. + - Relative Frequency Histogram + - shows the relative frequency (proportion) of data points in each bin. + - Cumulative Histogram + - shows the cumulative frequency of data points up to each bin. + - Cumulative Relative Frequency Histogram + - shows the cumulative relative frequency (proportion) of data points up to each bin. + - Pie Charts: https://www.geeksforgeeks.org/maths/pie-charts/ + - data is represented in a circular shape + - divided into sectors or slices that represent the proportion of each category in the whole + dataset. + - also known as circle graphs or pie diagrams + - useful for showing the relative sizes of parts to a whole, especially when there are a limited + number + of categories. + - Formula for Pie Charts + - Percentage = (Value of the category / Total value of all categories) * 100 + - Box Plots + - Scatter Plots + - displays individual data points on a two-dimensional coordinate system + - Each point on the plot represents the values of two variables, allowing us to observe any + patterns, trends, or relationships between them. + - commonly used in data analysis to visually explore the relationship between variables and to + identify any correlations or outliers present in the data. + - Types of Scatter Plots + - Simple Scatter Plot + - plots individual data points based on two variables. + - Bubble Chart + - similar to a scatter plot but with an additional dimension represented by the size of the + bubbles. + - useful for visualizing three variables in a two-dimensional space. + - 3D Scatter Plot + - extends the scatter plot into three dimensions, allowing for the visualization of + relationships between three variables. + - provides a more comprehensive view of the data, especially when there are complex + interactions between variables. + - Applications of Scatter Plots + - Correlation Analysis: + - identify the strength and direction of relationships between variables. + - Outlier Detection: + - identify data points that deviate significantly from the overall pattern. + - Trend Analysis: + - observe trends or patterns in the data, such as clusters or linear relationships. + - Pictograms + - pictorial representation of data + - e.g. using icons to represent the number of people in a survey + - Line Graphs + - shows how different things relate over time by connecting dots with straight lines + - useful for visualizing trends and patterns in data over time. + - Parts of a Line Graph + - Title: describes what the graph is about. + - Axes: + - X-axis (horizontal) typically represents time or categories. + - Y-axis (vertical) represents the values being measured. + - Data Points: individual points that represent the values of the variables being plotted. + - Lines: connect the data points to show trends or relationships between them. + - Legend: explains what different lines or colors represent in the graph. + - Types of Line Graphs + - Simple Line Graph + - plots a single line to show the relationship between two variables. + - Multiple Line Graph + - plots multiple lines on the same graph to compare different datasets or categories. + - Cumulative Line Graph + - shows the cumulative total of a variable over time, with each point representing the sum + of all previous values up to that point. + - Step Line Graph + - connects data points with horizontal and vertical lines, creating a step-like appearance, + useful for showing changes that occur at specific intervals. + - Compound Line Graph + - combines multiple line graphs into one, often with different scales or axes, to compare + different types of data in a single visualization. + - Frequency Distribution + - shows how often each value in a dataset occurs. + - Types: + - Grouped Frequency Distribution + - observations are divided between different intervals known as class intervals + - their frequencies are counted for each class interval. + - Ungrouped Frequency Distribution + - all distinct observations are mentioned and counted individually + - Relative Frequency Distribution + - distribution displays the proportion or percentage of observations in each interval or + class. + - useful for comparing different data sets or for analyzing the distribution of data within + a set. + - Relative Frequency = (Frequency of a class interval) / (Total number of observations) + - Cumulative Frequency Distribution + - distribution shows the cumulative total of frequencies up to a certain class interval. + - useful for understanding the number of observations that fall below a particular value or + threshold. + - Cumulative Frequency = Σ(Frequency of all class intervals up to and including the current + interval) + - Frequency Distribution Curve + - a graphical representation of a frequency distribution, often used to visualize the shape + of the + distribution and identify patterns such as skewness or modality. + - Types of Frequency Distribution Curves + - Normal Distribution Curve + - bell-shaped curve that is symmetric around the mean, representing a normal + distribution of data. + - Skewed Distribution Curve + - asymmetric curve that can be skewed to the left (negatively skewed) or right ( + positively skewed), + indicating that the data is not symmetrically distributed. + - Bimodal Distribution Curve + - curve with two distinct peaks, indicating the presence of two different groups or + modes in the data. + - Multimodal Distribution Curve + - curve with more than two peaks, indicating the presence of multiple groups or + modes in the data. + - Uniform Distribution Curve + - flat curve where all values have equal frequency, indicating that the data is + uniformly distributed. + - Exponential Distribution Curve + - curve that shows a rapid decrease in frequency as values increase, often used to + model time between events in a Poisson process. + - Log-Normal Distribution Curve + - curve that is skewed to the right, where the logarithm of the variable is normally + distributed, often used to model data that cannot be negative and has a long tail + on the right side. + - Formula for Frequency Distribution + - Frequency = (Number of observations in a class interval) / (Total number of observations) + - Coefficient of Variation (CV) = (Standard Deviation / Mean) * 100 + - Strategies for Handling Missing Values in Data Analysis + - Removing rows with missing values + - Imputation techniques + - Mean/Median/Mode Imputation + - replace missing values with the mean, median, or mode of the respective feature. + - Forward and Backward Fill + - forward fill (propagate last valid observation forward) and + - backward fill (propagate next valid observation backward). + - Interpolation + - estimate missing values based on the values of neighboring data points. + - K-Nearest Neighbors (KNN) Imputation + - Regression Imputation + - Multiple Imputation + - Feature Engineering Types + - Imputation + - replace missing values + - Handling Outliers + - anomalies in data + - Binning + - categorise data + - Log Transform + - normalization + - One-hot encoding + - categorical attributes + - Grouping operations + - Scaling + - e.g. weight and height may have different metrics + - Feature Engineering Techniques + - One-Hot Encoding + - convert categorical variables into binary indicators + - Binning + - transform continuous variables into discrete bins + - e.g labels = ['0-20', '21-40', '41-60', '61+'] + - Text Data Preprocessing + - removing stop-words + - filtering out commonly occurring words that + - provide no or very little semantic value to text analysis. + - e.g. "is", "the", "and" + - Categories of Stopwords + - Standard Stopwords + - articles("a", "the"), + - conjunctions ("and", "but") + - and prepositions ("in", "on") + - Domain-Specific Stopwords + - Contextual Stopwords + - Words with extremely high frequency + - Numerical Stopwords + - Digits, punctuation marks and single characters + - Stemming + - reducing words to their root form + - remove prefixes and suffixes. + - e.g. "running", "runner", and "ran" to "run" + - Stemming Algorithms + - Porter Stemmer + - rule-based algorithm + - removes common morphological and inflectional endings from words in English. + - suffix EED to EE + - 'agreed' → 'agree' + - suffix ED or ING to null + - e.g 'playing' → 'play', 'played' → 'play' + - suffix ATION to ATE + - e.g 'relational' → 'relate' + - suffix TIONAL to TION + - e.g 'conditional' → 'condition' + - suffix FUL to null + - e.g 'hopeful' → 'hope' + - Snowball Stemmer(Porter2) + - enhanced version of the Porter Stemmer + - supports multiple languages(multilingual stemmer) + - Example: + - 'running' → 'run' + - 'quickly' → 'quick' + - Lancaster Stemmer + - more aggressive and faster than other stemmers. + - also more destructive and may lead to excessively shortened stems + - Regexp Stemmer + - Regular Expression Stemmer + - allows users to define custom rules using regular expressions (regex). + - 'running' → 'runn' + Custom rule: r'ing$' removes the suffix ing. + - Krovetz Stemmer + - more linguistically accurate + - preserve meaning more effectively + - steps like converting plural forms to singular and removing ing from past-tense verbs. + - 'children' → 'child' + - Lemmatization + - Reduces words to their base form (lemma) ensuring a valid word. + - Considers the word's meaning and context to return the base form. + - Always produces a valid word. + - Example: "Better" → "good" + - Considers the context and part of speech. + - Vectorizing text data + - transform human language into a format that machines can comprehend and process + - numerical representations + - enable computers to perform tasks such as + - sentiment analysis, + - machine translation and + - information retrieval + - Vectors are numerical representations of words, phrases or entire documents. + - Techniques + - One Hot Encoding + - each word is represented by a vector with a high bit corresponding to + - the word’s index in the vocabulary + - all other bits set to zero. + - Bag of Words (BoW) + - converts text into a vector + - representing the frequency of words, disregarding grammar and word order. + - Ignores the order and context of words. + - Results in high-dimensional and sparse matrices. + - Term Frequency-Inverse Document Frequency (TF-IDF) + - extension of BoW that weighs the frequency of words by their importance across documents. + - reduces the weight of common words and increases the weight of rare but significant words. + - TF(t,d)= + Total number of terms in document d/Number of times term t appears in document d + - IDF(t)= log_e(Total number of documents/Number of documents with term t in it) + - Count Vectorizer + - represents text by counting the occurrences of each word in a document. + - similar to BoW but focuses on raw counts rather than binary presence. + - converts a collection of text documents to a matrix of token counts + - Word Embeddings (Word2Vec, GloVe) + - dense vector representations of words in a continuous vector space + - semantically similar words are located closer to each other. + - Image Embeddings + - transforms images into numerical representations + - Convolutional Neural Networks (CNNs) + - extract high-level features from images + - represent them as dense vectors. + - Feature Splitting + - Divide single feature into multiple sub-features, + - uncovering valuable insights + - Feature Engineering tools + - FeatureTools + - Tsfresh + - Pandas + - NumPy + - TPOT + - DataRobot + - Alteryx + - H2O.ai + - Model Selection + - Choosing appropriate algorithms based on: + - Problem Type + - Data Characteristics + - Computational Resources + - Model Training + - Process + - Iterative training + - Optimization + - Validation + - Model Types + - Supervised Learning + - Trains models on labeled data to predict or classify new, unseen data. + - Unsupervised Learning + - Finds patterns or groups in unlabeled data, like clustering or dimensionality reduction. + - Reinforcement Learning + - Learns through trial and error to maximize rewards, ideal for decision-making tasks. + - Self-Supervised Learning + - Utilizes large amounts of unlabeled data by + - generating own labels from the data itself + - Semi-Supervised Learning + - Combines a small amount of labeled data with a large amount of unlabeled data + - Model Evaluation + - Regularization Techniques: prevent overfitting by adding penalty to model complexity + - L1 Regularization (Lasso): Least Absolute Shrinkage and Selection Operator regression + - add the absolute value of magnitude of the coefficient as a penalty term to the loss function(L) + - L = Loss + λ * Σ|coefficients| + - can shrink some coefficients to zero, effectively performing feature selection. + - L2 Regularization (Ridge) + - add the squared magnitude of the coefficient as a penalty term to the loss function(L). + - handle multicollinearity by shrinking the coefficients of correlated features instead of eliminating + them. + - L = Loss + λ * Σ(coefficients)² + - Elastic Net Regression + - combines L1 and L2 regularization + - add the absolute norm of the weights as well as the squared measure of the weights + - balances between feature selection and coefficient shrinkage. + - L = Loss + λ1 * Σ|coefficients| + λ2 * Σ(coefficients)² + - Dropout + - Confusion Matrix + - True Positive + - When the model correctly predicts the positive class. + - True Negative + - when the model correctly predicts the negative class. + - False Positive + - when the model incorrectly predicts the positive class for a negative instance. + - False Negative + - when the model incorrectly predicts the negative class for a positive instance. + - Cross-Validation + - how well model performs on unseen data while preventing overfitting + - Train model on some parts and test it on the remaining part of dataset + - Techniques + - Holdout Validation + - 50% data for training and 50% for testing. + - LOOCV (Leave One Out Cross Validation) + - model is trained on the entire dataset except for one data point which is used for testing. + - repeated for each data point in the dataset. + - computationally expensive for large datasets. + - K-Fold Cross-Validation + - split the dataset into k equal-sized folds. + - train model on k-1 folds and tested on the remaining fold. + - process is repeated k times, with each fold used as the test set once. + - final performance is averaged over all k iterations. + - k should be greater than or equal to 5 and less than or equal to 10. + - Stratified K-Fold Cross-Validation + - ensures each fold of the cross-validation process has the same class distribution as the full + dataset. + - important for imbalanced datasets + - Leave-One-Out Cross-Validation (LOOCV) + - Shuffle Split Cross-Validation + - Performance Metrics + - Metrics + - features like: + - Accuracy + - ratio of correctly predicted instances to the total instances in a dataset. + - can be misleading when one class is more dominant over the other + - Accuracy = (TP + TN) / (TP + TN + FP + FN) + - Precision + - ratio of true positive predictions to the total predicted positives. + - measures the accuracy of positive predictions. + - important in scenarios where false positives are costly. + - like spam detection, fraud detection etc. + - Precision = TP / (TP + FP) + - Recall + - Recall (Sensitivity) is the ratio of true positive predictions to the total actual positives. + - measures how how good the model is at predicting positives. + - important in scenarios where false negatives are costly. + - like disease diagnosis, safety-critical systems etc. + - Recall = TP / (TP + FN) + - F1-Score + - F1-Score is the harmonic mean of Precision and Recall, + - providing a balance between the two metrics. + - useful when dealing with imbalanced datasets + - where one class is significantly more prevalent than the other. + - F1 = 2 * (Precision * Recall) / (Precision + Recall) + - Specificity(True Negative Rate) + - ratio of true negative predictions to the total actual negatives. + - measures how well the model identifies negative instances. + - important in scenarios where false positives are costly. + - like spam detection, fraud detection etc. + - Specificity = TN / (TN + FP) + - 1 - Specificity = False Positive Rate (FPR) + - Type I Error (False Positive Rate) + - occurs when the model incorrectly predicts the positive class for a negative instance. + - Type I Error Rate = FP / (FP + TN) + - Type II Error (False Negative Rate) + - occurs when the model incorrectly predicts the negative class for a positive instance. + - Type II Error Rate = FN / (FN + TP) + - Mean Absolute Error (MAE) + - average of the absolute differences between predicted and actual values. + - provides a straightforward measure of prediction accuracy. + - MAE = (1/n) * Σ|predicted - actual| + - Mean Squared Error (MSE) + - average of the squared differences between predicted and actual values. + - penalizes larger errors more than MAE. + - MSE = (1/n) * Σ(predicted - actual)² + - Root Mean Squared Error (RMSE) + - square root of the average of the squared differences between predicted and actual values. + - provides an interpretable measure of prediction accuracy in the same units as the target variable. + - RMSE = sqrt((1/n) * Σ(predicted - actual)²) + - R-squared (Coefficient of Determination) + - indicates the proportion of variance in the dependent variable that can be explained by the + independent variables. + - R² = 1 - (SS_res / SS_tot) + - SS_res = Σ(actual - predicted)² + - SS_tot = Σ(actual - mean(actual))² + - ROC-AUC Curve + - ROC-AUC (Receiver Operating Characteristic - Area Under the Curve) + - Check how well a binary classification model works + - how well the model separates the positive cases like people with a disease from the + - negative cases like people without the disease at different threshold level. + - measures a model's ability to distinguish between classes, + - with higher values indicating better performance. + - True Positive Rate (TPR) vs False Positive Rate (FPR) + - True Positive Rate (TPR) = TP / (TP + FN) = Recall + - False Positive Rate (FPR) = FP / (FP + TN) = 1 - Specificity + - Hyperparameter Model Tuning: + - optimize model performance by finding the best hyperparameters + - Grid Search CV + - brute-force approach + - train model using all possible combinations of specified hyperparameter values to + - find the best-performing setup. + - slow and uses a lot of computer power + - exhaustively searches through a specified subset of hyperparameters + - evaluates model performance for each combination + - computationally expensive for large hyperparameter spaces + - example: + - Tuning Logistic Regression with GridSearchCV + - Random Search CV + - randomly samples hyperparameter combinations from a given range + - evaluates model performance for each combination + - more efficient than Grid Search for large hyperparameter spaces + - Example: + - Tuning Random Forest with RandomizedSearchCV + - Tuning Decision Tree with RandomizedSearchCV + - Bayesian Optimization + - treats hyperparameter tuning like a mathematical optimization problem and + - learns from past results to decide what to try next. + - builds a probabilistic model(surrogate function) of the objective function + - uses it to select the most promising hyperparameter combinations to evaluate + - balances exploration and exploitation + - P(score(y)∣hyperparameters(x)) + - surrogate function models the relationship between hyperparameters x and the score y. + - surrogate models used in Bayesian optimization include: + - Gaussian Processes + - Random Forests + - Tree-structured Parzen Estimators (TPE) + - Genetic Algorithms + - inspired by the process of natural selection + - iteratively evolves a population of hyperparameter combinations + - selects the best-performing combinations for the next generation + - Model Deployment + - Containerization + - Docker + - Kubernetes + - Cloud Services + - AWS SageMaker + - Google AI Platform + - Microsoft Azure ML + - Model Monitoring and Maintenance + - Model Drift Detection + - Performance Monitoring + - Monitoring issues like: + - Data Drift + - Data Drift occurs when the + - statistical properties of the input data change over time, + - leading to a decline in model performance. + - Concept Drift + - Concept Drift refers to changes in the + - underlying relationships between input features and target variables over time, + - which can result in decreased model accuracy. + - Algorithms + - Regression + - Linear Regression + - assumes that there is a linear relationship between the input and output + - relationship represented by a straight line. + - uses the equation of a line: Y = mX + b + - Y = predicted value (output) + - m = slope of the line (coefficient) + - X = input feature + - b = y-intercept (constant term) + - Simple Linear Regression + - involves a single independent variable to predict a dependent variable. + - equation: Y = β0 + β1X1 + - Y = predicted value + - β0 = intercept + - β1 = coefficient for the feature + - X1 = input feature + - Multiple Linear Regression + - involves two or more independent variables to predict a dependent variable. + - equation: Y = β0 + β1X1 + β2X2 + ... + βnXn + - Y = predicted value + - β0 = intercept + - β1, β2, ..., βn = coefficients for each feature + - X1, X2, ..., Xn = input features + - Multicollinearity + - occurs when two or more independent variables in a regression model are highly correlated. + - can lead to unstable coefficient estimates and make it difficult to assess the individual + effect of each predictor. + - Detection Methods + - Variance Inflation Factor (VIF) + - quantifies how much the variance of a regression coefficient is increased due to + multicollinearity. + - VIF = 1 / (1 - R²) + - R² = coefficient of determination from regressing the predictor against all other + predictors. + - A VIF value greater than 5 or 10 indicates high multicollinearity. + - Correlation Matrix + - examines pairwise correlations between independent variables. + - high correlation coefficients (e.g., above 0.8 or below -0.8) suggest + multicollinearity. + - Condition Index + - assesses the sensitivity of the regression estimates to small changes in the data. + - values above 30 indicate potential multicollinearity issues. + - Remedies + - Remove one of the correlated variables + - Combine correlated variables into a single predictor using techniques like Principal + Component Analysis (PCA) + - Regularization methods like Ridge Regression or Lasso Regression + - Polynomial Regression + - models the relationship between the independent variable and dependent variable as an nth degree + polynomial. + - captures non-linear relationships by introducing polynomial terms of the independent variable. + - use cases + - House prices forcasting based on features like size, location, and number of bedrooms. + - Sales forcasting based on advertising spend and market trends. + - Estimating the impact of temperature on electricity consumption. + - Stock price prediction based on historical data and market indicators. + - Medical risk prediction based on patient health metrics. + - Minimizing the error + - Ordinary Least Squares (OLS) + - minimizes the sum of squared differences between observed and predicted values. + - Residuals = Observed - Predicted + - Cost Function (SSE) = Σ(Residuals)² + - Hypothesis in Linear Regression + - represents the predicted output based on input features and model parameters. + - Single linear Regression Hypothesis: + - h(X) = β0 + β1X1 + - h(X) = predicted value + - β0 = intercept + - β1 = coefficient for the feature + - X1 = input feature + - Multiple linear Regression Hypothesis: + - h(X) = β0 + β1X1 + β2X2 + ... + βnXn + - h(X) = predicted value + - β0 = intercept + - β1, β2, ..., βn = coefficients for each feature + - X1, X2, ..., Xn = input features + - Cost Function + - Mean Squared Error (MSE) + - MSE = (1/n) * Σ(predicted - actual)² + - Root Mean Squared Error (RMSE) + - RMSE = sqrt((1/n) * Σ(predicted - actual)²) + - Mean Absolute Error (MAE) + - MAE = (1/n) * Σ|predicted - actual| + - Gradient Descent + - find best fit line for the data + - iterative optimization algorithm + - minimizes the cost function by updating model parameters in the direction of the steepest descent. + - minimize the prediction error + - start with random model parameters and + - repeatedly adjust them to reduce the difference between predicted and actual values. + - cost function = MSE = (1/n) * Σ(predicted - actual)² + - Gradient computation: + - for MSE cost function, the gradient with respect to βj is: + - (∂/∂βj) * MSE = (2/n) * Σ(predicted - actual) * Xj + - Xj = input feature corresponding to βj + - update rule: + - βj = βj - α * (∂/∂βj) * Cost Function + - βj = model parameter (coefficient) + - α = learning rate (step size) + - (∂/∂βj) * Cost Function = gradient of the cost function with respect to βj + - Gradient Descent Variants for Linear Regression + - Batch Gradient Descent + - Stochastic Gradient Descent (SGD) + - Mini-Batch Gradient Descent + - Evaluation Metrics for Linear Regression + - R-squared (Coefficient of Determination) + - indicates how much variation the developed model can explain or capture + - R² = 1 - (SS_res / SS_tot) + - SS_res = Σ(actual - predicted)² + - SS_tot = Σ(actual - mean(actual))² + - Residual Standard Error (RSE) + - measures the average amount that the observed values deviate from the predicted values. + - RSE = sqrt(SS_res / (n - p - 1)) + - SS_res = Σ(actual - predicted)² + - n = number of observations + - p = number of predictors + - Residual sum of squares (RSS) + - RSS = Σ(actual - predicted)² + - Total sum of squares (TSS) + - TSS = Σ(actual - mean(actual))² + - Adjusted R-squared Error + - Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - p - 1)] + - n = number of observations + - p = number of predictors + - Mean Absolute Error (MAE) + - MAE = (1/n) * Σ|predicted - actual| + - Mean Squared Error (MSE) + - average of the squared differences between the actual and predicted values for all the data + points + - gives higher weight to larger errors. + - MSE = (1/n) * Σ(predicted - actual)² + - Root Mean Squared Error (RMSE) + - RMSE = sqrt((1/n) * Σ(predicted - actual)²) + - Logistic Regression + - Used when the output is a "yes or no" type answer + - helps in predicting categories like pass/fail or spam/not spam. + - predicts binary outcomes (0 or 1) based on input features. + - used for classification problems + - uses the logistic function (sigmoid function) to model the probability of the positive class. + - Logistic Function (Sigmoid Function) + - S curve + - S(t) = 1 / (1 + e^(-t)) + - S(t) = predicted probability of the positive class + - e = Euler's number (approximately 2.71828) + - t = linear combination of input features and model parameters + - Likelihood function for Logistic Regression + - L(β) = Π P(y_i | X_i; β) + - P(y_i | X_i; β) = S(X_i * β) if y_i = 1 + - P(y_i | X_i; β) = 1 - S(X_i * β) if y_i = 0 + - y_i = actual label (0 or 1) + - X_i = input features for the i-th instance + - β = model parameters (coefficients) + - Log likelihood function + - LL(β) = Σ [y_i * log(S(X_i * β)) + (1 - y_i) * log(1 - S(X_i * β))] + - Gradient of the Log Likelihood + - ∂LL(β) / ∂β = Σ (y_i - S(X_i * β)) * X_i + - Odds and Log-Odds + - Odds = P / (1 - P) + - P = probability of the positive class + - Log-Odds (Logit) = log(P / (1 - P)) + - Cost Function + - Logistic Regression uses the Log Loss (Cross-Entropy Loss) as its cost function. + - Log Loss(cost) = -[y * log(p) + (1 - y) * log(1 - p)] + - y = actual label (0 or 1) + - p = predicted probability of the positive class + - Model Training + - Maximum Likelihood Estimation (MLE) + - finds the model parameters that maximize the likelihood of observing the given data. + - Types of Logistic Regression + - Binary Logistic Regression + - Two classes (0 and 1) + - e.g. spam detection, disease diagnosis etc. + - Multinomial Logistic Regression + - more than two classes without any order + - e.g. classifying types of fruits (apple, banana, orange) + - Ordinal Logistic Regression + - more than two classes with a specific order + - e.g. rating scales (poor, average, good, excellent) + - Polynomial Regression + - Decision Trees + - supervised learning algorithm used for both classification and regression tasks + - tree-like model of decisions and their possible consequences. + - splits data into subsets based on feature values + - recursively partitions the data to create a tree structure. + - like a flowchart to help make decisions based on input features. + - Components + - Root Node + - branches: attribute values/outcomes of a test + - Internal Nodes: attribute tests + - Leaf Nodes: Final decision o predictions(class labels or continuous values) + - Applications + - Customer Segmentation + - Fraud Detection + - Medical Diagnosis + - Loan Approval + - Splitting Criteria + - Gini Index/Impurity + - measures how often a randomly chosen element would be incorrectly labeled + - if it was randomly labeled according to the distribution of labels in the subset. + - i.e. attribute with a lower Gini index should be preferred + - IGini = 1 - Σ(p_i)² + - p_i = proportion of instances belonging to class i + - Information Gain + - it tells how useful a question (or feature) is for splitting data into groups. + - measures how much the uncertainty decreases after the split. + - Information Gain = Entropy(Parent) - Weighted Average * Entropy(Children) + - Mean Squared Error (MSE) for regression tasks + - Gain(G, A) = Entropy(G) - Σ (|G_v| / |G|) * Entropy(G_v) + - G = dataset before the split + - A = feature used for splitting + - G_v = subset of G where feature A has value v + - Decision Tree Algorithms + - ID3 (Iterative Dichotomiser 3) + - greedily choosing the feature that maximizes the information gain at each node + - uses entropy as the splitting criterion + - Entropy measures impurity in the dataset + - Entropy(S) = - Σ p_i * log2(p_i) + - p_i = proportion of instances belonging to class i + - Information Gain + - IG(S, A) = Entropy(S) - Σ (|S_v| / |S|) * Entropy(S_v) + - S = dataset before the split + - A = feature used for splitting + - S_v = subset of S where feature A has value v + - it recursively splits the dataset using the feature with the highest information gain + - until all examples in a node belong to the same class or no features remain to split. + - prone to overfitting + - C4.5 + - modified version of information gain called the gain ratio + - to reduce the bias towards features with many values + - Gain Ratio = Information Gain / Split Information + - Split Information = - Σ (|S_v| / |S|) * log2(|S_v| / |S|) + - handles both continuous and categorical features + - manages missing values effectively + - prunes the tree after creation to reduce overfitting + - struggles with large datasets and high-dimensional data and noisy datasets. + - CART (Classification and Regression Trees) + - for both classification and regression tasks + - uses Gini impurity as the splitting criterion for classification tasks + - which measures the impurity(likelihood of incorrect classification) of a dataset. + - Gini Impurity = 1 - Σ(p_i)² + - p_i = proportion of instances belonging to class i + - for regression tasks, it uses Mean Squared Error (MSE) to minimize the variance within each node. + - produces binary trees, where each internal node has exactly two children. + - employs cost-complexity pruning to avoid overfitting + - handles both numerical and categorical data effectively. + - uses cost-complexity pruning after tree construction to reduce overfitting + - builds binary trees + - CHAID (Chi-squared Automatic Interaction Detector) + - uses chi-square tests to determine the best splits especially for categorical variables. + - chi-square tests + - find relationship between two entities. + - used to determine whether observed frequencies differ significantly from expected frequencies + - under given hypothesis. + - Applications + - feature selection + - goodness of fit testing + - independence testing + - A/B testing and feature evaluation + - Feature selection in machine learning + - Database Query Optimization + - test if actual row counts per partition match the expected uniform distribution. + - Uneven distribution (χ² significance) suggests a poor sharding strategy. + - NLP + - Evaluate word frequency distributions in texts. + - recursively splits the data based on the feature that shows the most significant association with + the target variable. + - can create multi-way splits, allowing nodes to have more than two children. + - handles both categorical and continuous features + - Chi-square statistic (χ²) + - χ² = Σ (O_i - E_i)² / E_i + - O_i = observed frequency for category i + - E_i = expected frequency for category i + - Classification: + - assign a class label to new data points by following the tree from the root to a leaf node + - with leaf node’s class label being assigned to data. + - Regression + - predicts the target variable by averaging the values at the leaf node. + - MARS (Multivariate Adaptive Regression Splines) + - extension of the CART algorithm + - uses splines to model non-linear relationships between variables + - constructs a piecewise linear model where the relationship between the input and output variables + - but with variable slopes at different points, known as knots. + - automatically selects and positions these knots based on the data distribution and the need to + capture non-linearities. + - Basis Functions + - h(x) = max(0, x - t) or h(x) = max(0, t - x) + - t = knot location + - Knot function + - points where the slope of the piecewise linear function changes. + - Conditional Inference Trees + - uses statistical tests to choose splits based on the relationship between features and the target + variable. + - use permutation tests to select the feature that best splits the data while minimizing bias. + - follows recursive approach + - At each node it evaluates the statistical significance of potential splits + - using tests like the Chi-squared test for categorical features and + - the F-test for continuous features. + - Types + - Classification Trees + - Regression Trees + - Pruning Techniques + - Ensemble Methods + - Bagging + - Boosting + - Stacking + - Support Vector Machines (SVM) + - used for both classification and regression tasks + - finds the optimal hyperplane that separates different classes in the feature space. + - aims to maximize the margin between the hyperplane and the nearest data points from each class + - Support Vectors + - data points closest to the hyperplane + - influence the position and orientation of the hyperplane. + - Hyperplane + - decision boundary that separates different classes in the feature space. + - Margin + - distance between the hyperplane and the nearest data points from each class. + - Objective Function + - maximize the margin while minimizing classification errors. + - Optimization Problem + - Minimize: (1/2) * ||w||² + C * Σ ξ_i + - w = weight vector defining the hyperplane + - C = regularization parameter balancing margin maximization and error minimization + - ξ_i = slack variables allowing for misclassification of data points + - Subject to: y_i * (w · x_i + b) ≥ 1 - ξ_i, for all i + - y_i = class label of data point i (+1 or -1) + - x_i = feature vector of data point i + - b = bias term + - Hinge Loss + - Hinge loss is used to measure the error in SVM classification. + - Hinge Loss Function + - L(y, f(x)) = max(0, 1 - y * f(x)) + - y = actual class label (+1 or -1) + - f(x) = predicted value from the SVM model + - Langrange Multipliers + - used to solve constrained optimization problems in SVMs. + - transform the constrained optimization problem into an unconstrained one + - by introducing additional variables (Lagrange multipliers) for each constraint. + - L(w, b, α) = (1/2) * ||w||² - Σ α_i [y_i * (w · x_i + b) - 1] + - α_i = Lagrange multiplier for data point i + - Dual problem for SVM + - Maximize: W(α) = Σ α_i - (1/2) * Σ Σ α_i * α_j * y_i * y_j * (x_i · x_j) + - Subject to: Σ α_i * y_i = 0 and α_i ≥ 0 for all i + - SVM Decision Boundary + - f(x) = w · x + b + - w = weight vector defining the hyperplane + - x = input feature vector + - b = bias term + - Use cases: + - Binary classification + - Spam vs Non-Spam + - Cat vs Dog + - Types + - Linear SVM + - uses linear boundary to separate classes + - Non-linear SVM + - uses Kernel functions to transform data into higher dimensions + - Kernel Functions + - Kernel: function transforms the input data into a higher-dimensional space + - allows SVM to find a linear separating hyperplane in that space. + - Common Kernel Functions + - Linear Kernel + - for linearly separable data + - Polynomial Kernel + - for non-linear data + - Radial Basis Function (RBF) Kernel + - for complex non-linear data + - transforms data into space based on distance from a center point + - ideal for spherical or circular relationships + - Sigmoid Kernel + - for certain types of non-linear data + - mimics neural network behavior using sigmoid activation function + - K-NN (K-Nearest Neighbors) + - Find closest data points(neighbours) to make predictions + - used for classification and regression tasks + - find k nearest data points in the training set and + - make predictions based on majority class (classification) or average value (regression) of those + neighbors. + - Also called Lazy Learner Algorithm + - because it does not learn a model during training phase + - simply stores the training data and makes predictions during the testing phase. + - based on proximity and majority voting (for classification) or averaging (for regression). + - Boundary + - shape of the decision boundary is determined by the distribution of training data points. + - Voronoi Diagram + - partitioning of a space into regions based on distance to a specific set of points. + - each region contains all points closer to one particular point than to any other. + - helps visualize how K-NN classifies new data points based on proximity to training data. + - What is k? + - number of nearest neighbors to consider when making predictions. + - How to choose k? + - Experiment with different k values using cross-validation + - choose the one that gives the best performance on validation data. + - if data has lot of outliers, choose larger k to smooth out predictions. + - Methods to choose k + - Cross-Validation + - k-fold cross-validation + - dividing the dataset into k parts + - k value chosen is the one that gives the highest average accuracy during these tests. + - Elbow Method + - Draw/plot accuracy vs different k values + - look for the "elbow" point where accuracy starts to level off + - choose k at this point to balance accuracy and complexity. + - Steps to implement K-NN + - Choose the number of neighbors (k) + - Calculate the distance between the new data point and all training data points + - Sort the distances and select the k nearest neighbors + - For classification: use majority voting to determine the class label + - For regression: calculate the average of the k nearest neighbors' values + - Distance Metrics: distance calculation to identify nearest neighbors + - Euclidean Distance + - shortest path between two points in multi-dimensional space + - d(p, q) = sqrt(Σ (p_i - q_i)²) + - p and q are two points + - p_i and q_i are the coordinates of points p and q in each dimension + - Manhattan Distance + - total absolute difference between coordinates of two points + - d(p, q) = Σ |p_i - q_i| + - Also called Taxi-cab distance or L1 norm + - otal distance you would travel if you could only move along horizontal and vertical lines like a + grid or city streets. + - Minkowski Distance + - family of distance metrics that includes both Euclidean and Manhattan distances as special cases. + - d(p, q) = (Σ |p_i - q_i|^p)^(1/p) + - p = order of the norm + - when p = 2, it becomes Euclidean distance + - when p = 1, it becomes Manhattan distance + - Hamming Distance + - Choosing the value of k + - small k: sensitive to noise, captures local patterns + - large k: smooths out predictions, captures global patterns + - Weighted K-NN + - closer neighbors have more influence on the prediction + - Applications of K-NN + - Image Recognition + - Recommender Systems + - Movie recommendations + - Product recommendations + - Medical Diagnosis + - Fraud Detection + - Spam detection + - comparing new emails to known spam and non-spam examples. + - Customer Segmentation + - Customer behavior analysis + - Anomaly Detection + - Speech Recognition + - Match spoken words to known patterns to convert them into text. + - Impact of K on decision boundary + - small k: complex, wiggly decision boundary + - large k: smoother, simpler decision boundary + - Naive Bayes + - predicts the category of a data point using probability + - based on Bayes' Theorem + - formula in probability theory used to update the probability of a hypothesis (\(A\)) based on new + evidence (\(B\)) + - Formula: + - P(A|B) = (P(B|A) * P(A)) / P(B) + - P(A|B) = posterior probability of hypothesis A given evidence B + - P(B|A) = likelihood of evidence B given hypothesis A + - P(A) = prior probability of hypothesis A + - P(B) = prior probability of evidence B + - Assumes independence between features + - Used for classification tasks + - Used mostly in high-dimensional text classification + - Naive assumption + - assumes that all features are independent of each other given the class label. + - simplifies the computation of probabilities + - allows the model to make predictions efficiently even with high-dimensional data. + - Assumptions + - Feature Independence + - assumes that all features are independent of each other given the class label. + - Class Conditional Independence + - assumes that the presence or absence of a feature is independent of the presence or absence of any + other feature, given the class label. + - Discrete Features have multinomial distribution + - assumes that features follow a multinomial distribution, especially in text classification tasks. + - Features are equally important + - assumes that all features contribute equally to the final prediction. + - No missing data + - assumes that there are no missing values in the dataset. + - Naive Bayes for continuous features + - assumes that continuous features follow a Gaussian (normal) distribution within each class. + - P(x|C) = (1 / sqrt(2πσ_C²)) * exp(- (x - μ_C)² / (2σ_C²)) + - P(x|C) = probability of feature x given class C + - μ_C = mean of feature x for class C + - σ_C² = variance of feature x for class C + - Types of Naive Bayes Classifiers + - Gaussian Naive Bayes + - continuous features + - assumes that features follow a normal (Gaussian) distribution or + - Normal Distribution within each class. + - i.e. Bell curve symmetric about the mean of the feature + - it assumes likelihood follows Gaussian distribution + - P(x|C) = (1 / sqrt(2πσ_C²)) * exp(- (x - μ_C)² / (2σ_C²)) + - P(x|C) = probability of feature x given class C + - μ_C = mean of feature x for class C + - σ_C² = variance of feature x for class C + - Multinomial Naive Bayes + - for discrete data and text classification problems + - Used to represent the frequencies of words or events in documents. + - assumes that features represent counts or frequencies of events. + - assumes each word or feature is independent of the others + - commonly used for text classification tasks like spam detection and sentiment analysis. + - [Multinomial.png](Multinomial.png) + - Maximum Likelihood Estimation (MLE) + - estimates the probabilities of features given the class labels based on their observed + frequencies in the training data. + - [MLE](MLE.png) + - Laplace Smoothing + - technique used to handle zero probabilities in Multinomial Naive Bayes. + - adds a small constant (usually 1) to the count of each feature + - ensures that no feature has a probability of zero. + - P(w|C) = (count(w, C) + 1) / (Σ count(w', C) + V) + - P(w|C) = probability of word w given class C + - count(w, C) = count of word w in documents of class C + - V = total number of unique words in the vocabulary + - Bernoulli Naive Bayes + - deals with binary/boolean features + - assumes that features are binary (0 or 1), indicating the presence or absence of an event. + - commonly used for document classification tasks where the presence or absence of words is + important + - for classification of binary features such as 'Yes' or 'No', '1' or '0', 'True' or 'False' etc + - P(x|C) = P(w|C)^x * (1 - P(w|C))^(1 - x) + - P(x|C) = probability of feature x given class C + - P(w|C) = probability of word w given class C + - x = binary feature (0 or 1) + - Bernoulli Distribution + - models the probability of a binary outcome (success/failure) for a single trial. + - P(X = k) = p^k * (1 - p)^(1 - k) + - k = outcome (0 or 1) + - p = probability of success (1) + - Complement Naive Bayes + - variant of Multinomial Naive Bayes + - designed to handle imbalanced datasets better. + - reduce the bias towards the majority class + - focuses on the complement of each class + - estimates probabilities based on instances not belonging to the class. + - often used in text classification tasks with imbalanced class distributions. + - Formula + - P(w|C) = (count(w, ¬C) + 1) / (Σ count(w', ¬C) + V) + - P(w|C) = probability of word w given class C + - count(w, ¬C) = count of word w in documents not belonging to class C + - V = total number of unique words in the vocabulary + - e.g. CNB estimates probabilities for Apples using Bananas' data (complement) and vice versa - Applications - - feature selection - - goodness of fit testing - - independence testing - - A/B testing and feature evaluation - - Feature selection in machine learning - - Database Query Optimization - - test if actual row counts per partition match the expected uniform distribution. - - Uneven distribution (χ² significance) suggests a poor sharding strategy. - - NLP - - Evaluate word frequency distributions in texts. - - recursively splits the data based on the feature that shows the most significant association with the target variable. - - can create multi-way splits, allowing nodes to have more than two children. - - handles both categorical and continuous features - - Chi-square statistic (χ²) - - χ² = Σ (O_i - E_i)² / E_i - - O_i = observed frequency for category i - - E_i = expected frequency for category i - - Classification: - - assign a class label to new data points by following the tree from the root to a leaf node - - with leaf node’s class label being assigned to data. - - Regression - - predicts the target variable by averaging the values at the leaf node. - - MARS (Multivariate Adaptive Regression Splines) - - extension of the CART algorithm - - uses splines to model non-linear relationships between variables - - constructs a piecewise linear model where the relationship between the input and output variables - - but with variable slopes at different points, known as knots. - - automatically selects and positions these knots based on the data distribution and the need to capture non-linearities. - - Basis Functions - - h(x) = max(0, x - t) or h(x) = max(0, t - x) - - t = knot location - - Knot function - - points where the slope of the piecewise linear function changes. - - Conditional Inference Trees - - uses statistical tests to choose splits based on the relationship between features and the target variable. - - use permutation tests to select the feature that best splits the data while minimizing bias. - - follows recursive approach - - At each node it evaluates the statistical significance of potential splits - - using tests like the Chi-squared test for categorical features and - - the F-test for continuous features. - - Types - - Classification Trees - - Regression Trees - - Pruning Techniques - - Ensemble Methods - - Bagging - - Boosting - - Stacking - - Support Vector Machines (SVM) - - used for both classification and regression tasks - - finds the optimal hyperplane that separates different classes in the feature space. - - aims to maximize the margin between the hyperplane and the nearest data points from each class - - Support Vectors - - data points closest to the hyperplane - - influence the position and orientation of the hyperplane. - - Hyperplane - - decision boundary that separates different classes in the feature space. - - Margin - - distance between the hyperplane and the nearest data points from each class. - - Types - - Linear SVM - - Non-linear SVM - - Kernel Functions - - Neural Networks - - Feedforward Neural Networks - - Convolutional Neural Networks (CNN) - - Recurrent Neural Networks (RNN) - - Tools and Frameworks - - Programming Languages - - Python - - R - - Julia - - Libraries - - Scikit-learn - - TensorFlow - - PyTorch - - Platforms - - Google Colab - - Jupyter Notebooks - - Azure Machine Learning Studio - - Applications/Use cases - - Image Recognition - - Natural Language Processing (NLP) - - Recommendation Systems - - Fraud Detection - - Predictive Analytics - - Overfitting - - when a model learns the training data too well, - - capturing noise and details that do not generalize to new data. - - underfitting - - when a model is too simple to capture the underlying patterns in the data, - - resulting in poor performance on both training and new data. + - Spam Detection + - Sentiment Analysis + - Document/Text Classification/Categorization + - Credit Scoring + - Medical Diagnosis + - Weather Prediction + - Random Forest + - uses decision trees as base learners + - each tree looks at a random subset of data and + - the results are combined by voting for classification or averaging for regression + - to make a final prediction. + - ensemble learning method + - Method that combines multiple models to improve overall performance. + - combines multiple decision trees(forest) to improve predictive performance and reduce overfitting. + - builds multiple decision trees during training + - merges their predictions to make a final prediction. + - Ensemble Methods + - Bagging (Bootstrap Aggregating) + - models are trained independently in parallel + - results are combined to make a final prediction. + - creates multiple subsets of the training data by sampling with replacement + - trains a decision tree on each subset + - combines their predictions through majority voting (classification) or averaging (regression). + - Bagging Algorithms + - Bootstrap Sampling + - randomly samples the training data with replacement to create 'N' multiple subsets. + - each subset is used to train a separate decision tree. + - Base Model Training + - trains a decision tree on each bootstrap sample independently in parallel. + - Prediction Aggregation + - combines predictions from multiple models to make a final prediction. + - majority voting for classification tasks + - averaging for regression tasks. + - Out of Bag Evaluation + - estimates model performance using data points not included in the bootstrap samples for + each tree. + - provides an unbiased estimate of the model's accuracy without needing a separate + validation set. + - Boosting + - train models one after another + - each model tries to correct the errors of the previous one. + - combines their predictions to make a final prediction. + - Algorithms + - AdaBoost (Adaptive Boosting) + - sequentially trains weak learners (usually decision stumps) + - each learner focuses more on the misclassified instances from the previous learners. + - combines their predictions through weighted voting (classification) or weighted + averaging (regression). + - Weight Update + - assigns equal weights to all training samples initially + - increases the weights of misclassified instances, reducing bias and variance + - decreases the weights of correctly classified instances + - so that subsequent learners focus more on the difficult cases. + - but it is sensitive to noisy data and outliers + - Gradient Boosting + - builds models sequentially + - each new model tries to correct the residual errors of the combined previous models. + - uses gradient descent to minimize a specified loss function. + - Loss Function Optimization + - minimizes a specified loss function (e.g., Mean Squared Error for regression, Log Loss + for classification) + - using gradient descent to optimize model predictions. + - XGBoost (Extreme Gradient Boosting) + - optimized implementation of gradient boosting + - designed for speed and performance + - includes regularization techniques to prevent overfitting. + - Regularization + - includes L1 (Lasso) and L2 (Ridge) regularization terms in the objective function + - helps prevent overfitting by penalizing complex models. + - LightGBM (Light Gradient Boosting Machine) + - gradient boosting framework that uses tree-based learning algorithms + - designed for efficiency and scalability + - uses histogram-based algorithms to speed up training and reduce memory usage. + - Histogram-based Splitting + - uses histogram-based algorithms to find optimal split points + - speeds up training and reduces memory usage. + - CatBoost (Categorical Boosting) + - gradient boosting library that handles categorical features natively + - employs symmetric decision trees and + - a unique encoding method that considers target values + - reduces the need for extensive data preprocessing. + - Categorical Feature Handling + - natively handles categorical features + - reduces the need for extensive data preprocessing. + - Stacking(Stacked Generalization) + - combines predictions from multiple models (base learners) + - uses a meta-model to learn how to best combine these predictions for improved accuracy. + - Meta Model + - model that learns to combine predictions from base learners. + - Feature Randomness + - at each split in a decision tree, a random subset of features is selected + - helps create diverse trees and reduces correlation between them. + - Out-of-Bag (OOB) Error Estimation + - estimates model performance using data points not included in the bootstrap samples for each tree. + - provides an unbiased estimate of the model's accuracy without needing a separate validation set. + - Feature Importance + - measures the contribution of each feature to the model's predictions + - helps identify the most relevant features for the target variable. + - Use Cases + - Classification Tasks + - Regression Tasks + - Anomaly Detection + - Feature Selection + - Applications + - Credit Scoring + - Fraud Detection + - Medical Diagnosis + - Stock Market Prediction + - out of Bag Scoring + - method to estimate the performance of Random Forest models + - Out of Bag (OOB) Error + - estimate of the model's prediction error + - calculated using data points not included in the bootstrap samples for each tree. + - Hyperparameters + - Number of Trees (n_estimators) + - Default: 100 + - Maximum Depth of Trees (max_depth) + - Default: None (nodes are expanded until all leaves are pure or contain less than min_samples_split + samples) + - Minimum Samples to Split a Node (min_samples_split) + - Minimum number of samples required to split an internal node + - Default: 2 + - Minimum Samples per Leaf (min_samples_leaf) + - Default: 1 + - Number of Features to Consider at Each Split (max_features) + - Default max features: + - for classification: sqrt(total number of features) + - for regression: total number of features / 3 + - Max Leaf Nodes (max_leaf_nodes) + - Default: None (unlimited) + - Max Samples (max_samples) + - Default: None (use all samples) + - Hyperparameter Tuning + - Grid Search + - Random Search + - Bayesian Optimization + - Neural Networks + - Feedforward Neural Networks + - Convolutional Neural Networks (CNN) + - Recurrent Neural Networks (RNN) + - Tools and Frameworks + - Programming Languages + - Python + - R + - Julia + - Libraries + - Scikit-learn + - TensorFlow + - PyTorch + - Platforms + - Google Colab + - Jupyter Notebooks + - Azure Machine Learning Studio + - Applications/Use cases + - Image Recognition + - Natural Language Processing (NLP) + - Recommendation Systems + - Fraud Detection + - Predictive Analytics + - Overfitting + - when a model learns the training data too well, + - capturing noise and details that do not generalize to new data. + - underfitting + - when a model is too simple to capture the underlying patterns in the data, + - resulting in poor performance on both training and new data. - Supervised Learning - - Classification - - goal is to predict discrete labels or categories - - e.g. spam detection, image recognition, sentiment analysis etc. - - Regression - - aim to predict continuous numerical values - - e.g. house price prediction, stock price forecasting, temperature prediction etc. - - Algorithms - - Linear Regression - - predict numbers using a straight line. - - helps find the relationship between input and output. - - uses the equation of a line: Y = mX + b - - Y = predicted value (output) - - m = slope of the line (coefficient) - - X = input feature - - b = y-intercept (constant term) - - Logistic Regression - - Decision Trees - - Random Forest - - Support Vector Machines (SVM) - - Neural Networks + - Classification + - goal is to predict discrete labels or categories + - e.g. spam detection, image recognition, sentiment analysis etc. + - Regression + - aim to predict continuous numerical values + - e.g. house price prediction, stock price forecasting, temperature prediction etc. + - Algorithms + - Linear Regression + - predict numbers using a straight line. + - helps find the relationship between input and output. + - uses the equation of a line: Y = mX + b + - Y = predicted value (output) + - m = slope of the line (coefficient) + - X = input feature + - b = y-intercept (constant term) + - Logistic Regression + - Decision Trees + - Random Forest + - Support Vector Machines (SVM) + - Neural Networks - Unsupervised Learning - - Clustering - - Dimensionality Reduction + - Clustering + - groups similar data points together into clusters + - based on their characteristics, without using any labeled data + - discovery of natural groupings and hidden patterns in complex datasets. + - Similarity Measures: + - Euclidean distance, + - Cosine Similarity + - Manhattan Distance + - Use cases + - Customer Segmentation: group customers with similar shopping habits + - Anomaly Detection: Identifying outliers or fraudulent activities in finance, network security and sensor + data. + - Image Segmentation: Dividing images into meaningful parts for object detection, medical diagnostics or + computer vision tasks. + - Recommendation Systems: Clustering user preferences to recommend movies, products or content tailored to + different groups. + - Market Basket Analysis: Discovering products frequently bought together to optimize store layouts and + promotions. + - Types of clustering + - Hard Clustering + - each data point strictly belongs to exactly one cluster + - no overlap is allowed + - Use cases: + - Market segmentation, + - customer grouping, + - document clustering. + - Soft Clustering + - assigns each data point a probability or degree of membership to multiple clusters simultaneously + - allowing data points to partially belong to several groups. + - e.g. + - A data point may have a 70% membership in Cluster 1 and 30% in Cluster 2, + - reflecting uncertainty or overlap in group characteristics. + - Clustering Methods + - Centroid-based Clustering (Partitioning Methods) + - organizes data points around central prototypes called centroids + - each cluster is represented by the mean (or medoid) of its members + - number of clusters is specified in advance + - algorithm allocates points to the nearest centroid + - Algos + - K-Means clustering + - Iteratively assigns points to nearest centroid + - recalculates centroids to minimize intra-cluster variance + - Helps identify natural groupings in unlabeled datasets + - used in customer segmentation, image compression, and pattern discovery + - working + - k represents the number of groups or clusters we want to classify our items into. + - choose initial centroids randomly from the data points + - assign each data point to the nearest centroid to form clusters + - recalculate the centroids by taking the mean of all data points in each cluster + - repeat the assignment and update steps until the centroids do not change significantly or + a maximum number of iterations is reached. + - elbow method to determine optimal k + - plot the within-cluster sum of squares (WCSS) against different values of k + - look for the "elbow" point where the rate of decrease in WCSS slows down + - choose k at this point to balance cluster compactness and model simplicity. + - Distortion (WCSS) + - measures the average squared distance between each data point and its assigned cluster + center. + - measure of how well the clusters represent the data. + - lower distortion indicates that data points are closer to their assigned centroids, + suggesting better clustering. + - Distortion = Σ (distance of each point to its assigned centroid)² + - Inertia + - sum of squared distances of each data point to its closest cluster center + - total squared error of the clustering. + - lower inertia indicates that data points are closer to their assigned centroids, + suggesting better clustering. + - Inertia = Σ (distance of each point to its assigned centroid)² + - K means++ clustering + - improves K-means by selecting initial centroids more strategically + - helps achieve better convergence and clustering results. + - initialization process + - randomly select the first centroid from the data points + - for each subsequent centroid, + - calculate the distance from the existing centroids and + - select the next centroid with a probability proportional to the square of the + distance. + - P(x) = D(x)² / Σ D(x)² + - P(x) = probability of selecting data point x as the next centroid + - D(x) = distance from data point x to the nearest existing centroid + - K mode clustering + - variant of K-means for categorical data + - uses modes/number of mismatches instead of means to represent cluster centers. + - suitable for clustering categorical data where calculating means is not meaningful. + - working + - k represents the number of clusters we want to classify our items into. + - choose initial modes randomly from the data points + - assign each data point to the nearest mode based on the number of mismatches (Hamming + distance) + - recalculate the modes by selecting the most frequent category for each feature in each + cluster + - repeat the assignment and update steps until the modes do not change significantly or a + maximum number of iterations is reached. + - Fuzzy C-Means clustering(FCM) + - [Fuzzy](Fuzzy.webp) + - allows data points to belong to multiple clusters with varying degrees of membership + - captures ambiguity and soft boundaries in data. + - particularly useful when the clusters overlap or boundaries are not clear-cut. + - working + - k represents the number of clusters we want to classify our items into. + - choose initial cluster centers randomly from the data points + - calculate the degree of membership for each data point to each cluster based on distance + - V_ij = 1 / Σ (d_ij / d_ik)^(2/(m-1)) + - V_ij = degree of membership of data point i to cluster j + - d_ij = distance from data point i to cluster center j + - m = fuzziness parameter that controls the degree of membership (m > 1) + - update the cluster centers by taking a weighted average of all data points, where weights + are determined by their degree of membership + - repeat the assignment and update steps until convergence is achieved (i.e., when changes + in cluster centers or membership degrees fall below a specified threshold). + - K-medoids + - uses actual data points (medoids) as centers, robust to outliers. + - Density-based Clustering (Model-based Methods) + - defines clusters as contiguous regions of high data density separated by areas of lower density + - identify clusters of arbitrary shapes + - does not require predefining the number of clusters + - Algos + - DBSCAN (Density-Based Spatial Clustering of Applications with Noise) + - Groups points with sufficient neighbors + - labels sparse points as noise. + - Parameters: + - Epsilon (ε): + - radius to search for neighboring points + - determines the size of the neighborhood around each point. + - smaller ε leads to more clusters and more noise, while larger ε may merge distinct + clusters together. + - determine eps is by analyzing the k-distance graph. + - MinPts: + - minimum number of points required to form a dense region + - MinPts = 3 is a common default value, but it can be adjusted based on the dataset and + desired cluster density. + - MinPts>=D+1, where D is the number of dimensions in the dataset. + - Core Points: points with at least MinPts neighbors within ε + - Working + - For each point, count the number of points within ε radius. + - If the count is greater than or equal to MinPts, label it as a core point and form a + cluster. + - If a point is not a core point but is within ε of a core point, label it as a border point + and assign it to the same cluster. + - If a point is neither a core point nor a border point, label it as noise. + - Evaluation + - Silhouette Score: + - measures how similar a point is to its own cluster compared to other clusters. + - ranges from -1 to 1, with higher values indicating better clustering. + - 0 indicates that the point is on the boundary between two clusters, while negative + values suggest that the point may have been assigned to the wrong cluster. + - Silhouette Score = (b - a) / max(a, b) + - a = average distance from the point to other points in the same cluster + - b = average distance from the point to points in the nearest cluster + - Davies-Bouldin Index: + - evaluates the average similarity between clusters, + - lower values indicate better clustering. + - Adjusted Rand Index: + - measures the similarity between two clusterings, + - adjusted for chance grouping. + - range 0 to 1, with higher values indicating better agreement between the clusterings. + - 0 indicates random clustering, while 1 indicates perfect agreement. + - OPTICS (Ordering Points To Identify Clustering Structure) + - Extends DBSCAN to handle varying densities + - produces an ordering of points based on their reachability distance, + - allowing for the identification of clusters at different density levels. + - Concepts + - Core Distance: + - distance from a point to its MinPts-th nearest neighbor, + - indicating how densely the point is surrounded by other points. + - Reachability Distance: + - distance from a point to another point, + - considering the core distance of the second point, + - which helps determine how reachable one point is from another. + - Reachability Plot + - [OpticCLuster](OpticCLuster.webp) + - visual representation of the reachability distances of points in the dataset. + - helps identify clusters by looking for valleys in the plot, + - where lower reachability distances indicate denser regions of data. + - Connectivity-based Clustering (Hierarchical Clustering) + - builds nested groupings of data by evaluating how data points are connected to their neighbors. + - creates a dendrogram a tree-like structure that reflects relationships at various granularity levels + - does not require specifying cluster numbers in advance + - but can be computationally intensive + - Dendogram + - tree-like diagram that illustrates the arrangement of clusters produced by hierarchical + clustering. + - shows the relationships between clusters at different levels of similarity. + - Distance Metrics + - Single Linkage: distance between the closest points of two clusters + - Complete Linkage: distance between the farthest points of two clusters + - Average Linkage: average distance between all pairs of points in two clusters + - Ward's Method: minimizes the total within-cluster variance when merging clusters + - Minimum Distance: distance between the closest points of two clusters + - Maximum Distance: distance between the farthest points of two clusters + - Group Average Distance: average distance between all pairs of points in two clusters + - Algos + - Agglomerative (Bottom-up) + - [AglomerativeHierarchialClustering](AglomerativeHierarchialClustering.webp) + - Start with each point as a cluster + - iteratively merge the closest clusters. + - Divisive (Top-down) + - Start with one cluster + - iteratively split into smaller clusters. + - Affinity Propagation(AP) + - identifies exemplars among data points + - forms clusters around these exemplars based on message passing between points. + - does not require specifying the number of clusters in advance + - automatically determines the optimal number of clusters based on the data. + - Message Passing + - iterative process where data points exchange messages to determine cluster assignments. + - each point sends messages to other points indicating how well they would serve as + exemplars + for each other. + - through this communication, the algorithm identifies exemplars and forms clusters around + them. + - Working + - Initialize messages between data points + - Iteratively update messages based on similarity and preferences + - Identify exemplars and assign clusters based on message convergence. + - Similar Computation + - calculates the similarity between data points to determine how well they would serve + as + exemplars for each other. + - Similarity can be based on distance metrics or other measures of similarity relevant + to the + data. + - Similarity(i, k) = -distance(i, k) + - s(i, k) = -||x_i - x_k||² + - Similarity(i, k) = similarity between data point i and potential exemplar k + - distance(i, k) = distance between data point i and potential exemplar k + - Responsibility Update + - updates the responsibility messages that indicate how well a data point serves as an + exemplar + for another point. + - Responsibility(i, k) = Similarity(i, k) - max(Similarity(i, j) + Availability(j, k)) + - R(i, k) = S(i, k) - max_{j ≠ k} (S(i, j) + A(j, k)) + - Responsibility(i, k) = responsibility of data point i for potential exemplar k + - Similarity(i, k) = similarity between data point i and potential exemplar k + - Availability(j, k) = availability of potential exemplar j for data point k + - Availability Update + - updates the availability messages that indicate how appropriate it would be for a data + point to + choose another point as its exemplar. + - Availability(i, k) = min(0, Responsibility(k, k) + Σ max(0, Responsibility(j, k))) + - A(i, k) = min(0, R(k, k) + Σ max(0, R(j, k))) for j ≠ i + - Availability(i, k) = availability of potential exemplar k for data point i + - Responsibility(k, k) = self-responsibility of potential exemplar k + - Responsibility(j, k) = responsibility of other data points j for potential + exemplar k + - Iteration and Convergence + - the algorithm iteratively updates responsibility and availability messages until + convergence is + achieved, meaning that cluster assignments stabilize and do not change significantly + with + further iterations. + - r(i, i) + a(i, i) > 0 indicates that data point i is an exemplar + - Parameters + - Preference: + - controls the number of clusters by influencing the likelihood of points being + chosen as exemplars. + - higher preference values lead to more clusters, while lower values result in fewer + clusters. + - Damping Factor: + - helps stabilize message updates and prevent oscillations during convergence. + - typically set between 0.5 and 1, with higher values providing more stability at the + cost of slower convergence. + - Damping Factor (λ) = 0.5 to 1 + - Without damping, the algorithm may oscillate or fail to converge. + - Distribution-based Clustering + - assumes data is generated from a mixture of probability distributions, + - such as Gaussian distributions and assigns points to clusters based on statistical likelihood + - supports clusters with flexible shapes and overlaps + - but requires specifying the number of distributions. + - Algos + - Gaussian Mixture Model (GMM) + - probabilistic model that assumes data is generated from + - a mixture of several Gaussian distributions + - with unknown parameters. + - soft clustering method that assigns data points to clusters based on the probability of + belonging to each cluster + - Fits data as a weighted mixture of Gaussian distributions + - assigns data points based on likelihood. + - probablity density function of a Gaussian distribution + - P(x) = (1 / (σ * sqrt(2π))) * exp(- (x - μ)² / (2σ²)) + - P(x) = probability density of data point x + - μ = mean of the Gaussian distribution + - σ = standard deviation of the Gaussian distribution + - EM algorithm for GMM + - Expectation-Maximization (EM) algorithm is an iterative method used to find maximum + likelihood estimates of parameters in statistical models, particularly for Gaussian + Mixture Models (GMMs). + - E-step (Expectation step): + - calculates the expected membership probabilities (responsibilities) for each data + point belonging to each Gaussian component based on current parameter estimates. + - γ_ik = P(z_i = k | x_i, θ) = (π_k * P(x_i | μ_k, σ_k)) / Σ (π_j * P(x_i | μ_j, σ_j)) + - γ_ik = responsibility of component k for data point i + - π_k = mixing coefficient for component k + - P(x_i | μ_k, σ_k) = probability density of data point x_i under Gaussian component + k + - M-step (Maximization step): + - updates the parameters of the Gaussian components (means, variances, and mixing + coefficients) based on the responsibilities calculated in the E-step. + - μ_k = (Σ γ_ik * x_i) / Σ γ_ik + - σ_k² = (Σ γ_ik * (x_i - μ_k)²) / Σ γ_ik + - π_k = Σ γ_ik / N + - N = total number of data points + - Covariance and weights + - covariance matrix (Σ_k) captures the shape and orientation of each Gaussian component + - mixing coefficients (π_k) represent the proportion of data points belonging to each + component. + - Posterior probabilities(Cluster Responsibilities) + - after fitting the GMM, we can calculate the posterior probabilities for each data point + belonging to each cluster using Bayes' theorem. + - P(z_i = k | x_i) = (π_k * P(x_i | μ_k, σ_k)) / Σ (π_j * P(x_i | μ_j, σ_j)) + - P(z_i = k | x_i) = posterior probability of data point i belonging to cluster k + - π_k = mixing coefficient for component k + - P(x_i | μ_k, σ_k) = probability density of data point x_i under Gaussian component k + - Likelihood Function + - measures how well the GMM fits the data + - calculated as the product of the probabilities of all data points under the model. + - L(θ) = Π P(x_i | θ) = Π Σ (π_k * P(x_i | μ_k, σ_k)) + - L(θ) = likelihood of the model parameters θ given the data + - P(x_i | θ) = probability density of data point x_i under the GMM + - π_k = mixing coefficient for component k + - P(x_i | μ_k, σ_k) = probability density of data point x_i under Gaussian component k + - Expectation-Maximization (EM) Algorithm + - iterative method to find maximum likelihood estimates of parameters in GMMs. + - E-step: + - calculates responsibilities (membership probabilities) for each data point. + - M-step: + - updates parameters (means, variances, mixing coefficients) based on responsibilities. + - Log Likelihood + - logarithm of the likelihood function + - used to evaluate model fit and convergence in GMMs. + - log L(θ) = Σ log P(x_i | θ) = Σ log Σ (π_k * P(x_i | μ_k, σ_k)) + - log L(θ) = log-likelihood of the model parameters θ given the data + - P(x_i | θ) = probability density of data point x_i under the GMM + - π_k = mixing coefficient for component k + - P(x_i | μ_k, σ_k) = probability density of data point x_i under Gaussian component k + - Cluster shapes in GMM + - Mean (μ_k): represents the center of each Gaussian component, indicating the average location + of data points in that cluster. + - Covariance Matrix (Σ_k): captures the shape and orientation of each Gaussian component, + allowing for clusters that can be elongated, circular, or oriented in different directions. + - GMM allows: + - elongated clusters (if covariance matrices have different eigenvalues) + - circular clusters (if covariance matrices are isotropic) + - tilted clusters (if covariance matrices have off-diagonal elements) + - overlapping clusters (if means are close and covariance matrices are large) + - Visualization of GMM Clusters + - contour plots + - can be used to visualize the Gaussian components and + - their corresponding clusters in a 2D feature space. + - each contour represents a level of constant probability density for the Gaussian + distribution, + - helping to illustrate the shape and spread of each cluster. + - Scatter plots + - can also be used to visualize the data points colored by their assigned cluster, + - along with the means of each Gaussian component to show the cluster centers. + - Probability Density Function (PDF) + - describes the likelihood of a continuous random variable taking on a specific value. + - for GMM, the PDF is a weighted sum of Gaussian distributions. + - P(x) = Σ (π_k * P(x | μ_k, σ_k)) + - P(x) = probability density of data point x under the GMM + - π_k = mixing coefficient for component k + - P(x | μ_k, σ_k) = probability density of data point x under Gaussian component k + - PDF tells us: + - Relative Likelihood of values within a given interval + - Shape of the Distribution + - Expected Value (Mean) and Variance + - use cases of PDF + - Rainfall Prediction: modeling the distribution of daily rainfall amounts to estimate the + likelihood of different rainfall levels. + - Stock Price Modeling: modeling the distribution of stock returns to assess the probability + of various return levels and inform investment decisions. + - Customer Behavior Analysis: modeling the distribution of customer purchase amounts to + understand spending patterns and predict future behavior. + - Income Distribution Analysis: modeling the distribution of household incomes to analyze + economic inequality and inform policy decisions. + - Important Measures of a Probability Distribution + - Mean of Probability Density Function + - average value of the random variable + - also called as expected value or expectation. + - E[X] = ∫ x * P(x) dx + - E[X] = expected value (mean) of the random variable X + - P(x) = probability density function of X + - Median of Probability Density Function + - value which divides the probability density function graph into two equal halves. + - the point at which the cumulative density function (CDF) equals 0.5. + - Median = CDF^(-1)(0.5) + - CDF^(-1) = inverse cumulative density function + - Variance of Probability Density Function + - measure of the spread or dispersion of the random variable around its mean. + - Var(X) = E[(X - E[X])²] = ∫ (x - E[X])² * P(x) dx + - Var(X) = variance of the random variable X + - E[X] = expected value (mean) of X + - P(x) = probability density function of X + - Standard Deviation of Probability Density Function + - square root of the variance + - provides a measure of the average distance of data points from the mean. + - SD(X) = sqrt(Var(X)) = sqrt(E[(X - E[X])²]) + - SD(X) = standard deviation of the random variable X + - Var(X) = variance of X + - Types of Probability Distributions + - Normal Distribution (Gaussian distribution) + - bell-shaped curve + - symmetric about its mean + - defined by its mean and standard deviation. + - P(x) = (1 / (σ * sqrt(2π))) * exp(- (x - μ)² / (2σ²)) + - P(x) = probability density of data point x + - μ = mean of the distribution + - σ = standard deviation of the distribution + - in standard normal distribution, mean = 0 and standard deviation = 1 + - so, P(z) = (1 / sqrt(2π)) * exp(- z² / 2) + - P(z) = probability density of data point z under the standard normal + distribution + - z = (x - μ) / σ (standardized value of x) + - Uniform Distribution(rectangular distribution) + - all outcomes are equally likely within a specified range. + - f(x) = 1 / (b - a) for a ≤ x ≤ b + - a = lower bound of the distribution + - b = upper bound of the distribution + - Exponential Distribution + - models the time between events in a Poisson process. + - Binomial Distribution + - models the number of successes in a fixed number of independent Bernoulli + trials. + - P(X = k) = C(n, k) * p^k * (1 - p)^(n - k) + - P(X = k) = probability of observing k successes in n trials + - C(n, k) = binomial coefficient (number of ways to choose k successes from + n trials) + - p = probability of success on an individual trial + - Chi-Square Distribution + - models the distribution of the sum of squared standard normal variables. + - P(X = k) = (1 / (2^(k/2) * Γ(k/2))) * x^(k/2 - 1) * exp(-x/2) + - P(X = k) = probability density of data point x under the chi-square + distribution with k degrees of freedom + - Γ(k/2) = gamma function evaluated at k/2 + - Joint Probability Density Function + - describes the probability of two or more random variables taking on specific + values simultaneously. + - for GMM, the joint PDF can be calculated by multiplying the PDFs of the individual + Gaussian components and their mixing coefficients. + - P(x, y) = Σ (π_k * P(x | μ_k, σ_k) * P(y | μ_k, σ_k)) + - P(x, y) = joint probability density of data points x and y under the GMM + - π_k = mixing coefficient for component k + - P(x | μ_k, σ_k) = probability density of data point x under Gaussian component + k + - P(y | μ_k, σ_k) = probability density of data point y under Gaussian component + k + - Kernel Density Estimation (KDE) + - non-parametric method to estimate the probability density function of a random variable. + - Unlike histograms, which use discrete bins, + - KDE provides a smooth and continuous estimate of the underlying distribution, + - making it particularly useful when dealing with continuous data. + - smooths data points using a kernel function to create a continuous density estimate. + - useful for visualizing the distribution of data and identifying underlying patterns + without assuming a specific parametric form. + - f(x) = (1 / (n * h)) * Σ K((x - x_i) / h) + - f(x) = estimated probability density function at point x + - n = number of data points + - h = bandwidth (smoothing parameter) + - K = kernel function (e.g., Gaussian kernel) + - x_i = individual data points + - Kernel Functions + - Gaussian Kernel: K(u) = (1 / sqrt(2π)) * exp(-u² / 2) + - Epanechnikov Kernel: K(u) = 0.75 * (1 - u²) for |u| ≤ 1, 0 otherwise + - Uniform Kernel: K(u) = 0.5 for |u| ≤ 1, 0 otherwise + - Triangular Kernel: K(u) = (1 - |u|) for |u| ≤ 1, 0 otherwise + - Quartic (Biweight) Kernel: K(u) = (15/16) * (1 - u²)² for |u| ≤ 1, 0 otherwise + - bandwidth selection + - critical parameter that controls the smoothness of the density estimate + - too small bandwidth leads to overfitting (too wiggly), while too large bandwidth leads + to underfitting (too smooth). + - methods for selecting bandwidth include Silverman's rule of thumb, cross-validation, + and plug-in methods. + - Optimal Bandwidth Selection + - Silverman's Rule of Thumb: h = 1.06 * σ * n^(-1/5) + - h = optimal bandwidth + - σ = standard deviation of the data + - n = number of data points + - Cross-Validation: selecting the bandwidth that minimizes the mean integrated squared + error (MISE) through techniques like leave-one-out cross-validation. + - Plug-in Methods: estimating the optimal bandwidth by plugging in estimates of the + underlying density's derivatives. + - Multivariate KDE + - extends KDE to estimate the joint probability density function of multiple random + variables. + - f(x) = (1 / (n * h^d)) * Σ K((x - x_i) / h) + - f(x) = estimated joint probability density function at point x + - n = number of data points + - h = bandwidth (smoothing parameter) + - d = number of dimensions (random variables) + - K = multivariate kernel function (e.g., multivariate Gaussian kernel) + - x_i = individual data points in d-dimensional space + - Variants of KDE + - Adaptive KDE: adjusts the bandwidth locally based on the density of data points, + allowing for better estimation in regions with varying data density. + - Boundary-Corrected KDE: modifies the kernel function near the boundaries of the data + to reduce bias in density estimation. + - Fast KDE: utilizes efficient algorithms and data structures (e.g., KD-trees) to speed + up the computation of KDE for large datasets. + - Cummulative Density Function (CDF) + - describes the probability that a continuous random variable takes on a value less than or + equal to a specific value. + - for GMM, the CDF can be calculated by integrating the PDF over the desired range. + - CDF(x) = Σ (π_k * CDF_k(x)) + - CDF(x) = cumulative density function of data point x under the GMM + - π_k = mixing coefficient for component k + - CDF_k(x) = cumulative density function of data point x under Gaussian component k + - PDF is obtained by differentiating the Cumulative Distribution Function (CDF), and the CDF can be + obtained by integrating the PDF + - Dirichlet Process Mixture Models (DPMMs) + - automatically decide the number of clusters based on the data. + - non-parametric Bayesian approach that allows for an infinite number of potential clusters, + with the actual number determined by the data. + - uses a Dirichlet process as a prior distribution over the mixture components, enabling + flexible modeling of complex data distributions without needing to specify the number of + clusters in advance. + - working + - data points are assigned to clusters based on a Chinese Restaurant Process (CRP) or + stick-breaking construction, which allows for the creation of new clusters as needed. + - the model iteratively updates cluster assignments and parameters using Markov Chain Monte + Carlo (MCMC) sampling or variational inference methods to find the best fit for the data. + - the number of clusters can grow as more data points are observed, allowing for a flexible + and adaptive clustering solution that can capture the underlying structure of the data + without overfitting or underfitting. + - the Dirichlet process is parameterized by a concentration parameter (α) that controls the + likelihood of creating new clusters, with higher values of α leading to more clusters and + lower values leading to fewer clusters. + - the model can be used for various applications, including clustering, density estimation, + and topic modeling, where the number of underlying groups or topics is unknown and needs + to be inferred from the data. + - DPMMs are particularly useful in scenarios where the data may have a complex structure + with an unknown number of clusters, such as in natural language processing, image + analysis, and bioinformatics. + - Concepts + - Chinese Restaurant Process (CRP): a metaphor for how data points are assigned to clusters + in a DPMM, where customers (data points) choose tables (clusters) based on the number of + customers already at the table and a concentration parameter. + - Stick-breaking Construction: a method for constructing the Dirichlet process by breaking a + stick into pieces that represent the weights of the mixture components, allowing for an + infinite number of potential clusters. + - Beta Distribution: a family of continuous probability distributions that can be used as a + prior for the mixing proportions in a DPMM, allowing for flexible modeling of cluster + weights. + - f(x; α, β) = (1 / B(α, β)) * x^(α - 1) * (1 - x)^(β - 1) + - f(x; α, β) = probability density function of the Beta distribution with parameters + α and β + - B(α, β) = beta function evaluated at α and β + - x = value between 0 and 1 representing the mixing proportion for a cluster + - Dirichlet distribution: a family of continuous multivariate probability distributions that + can be used as a prior for + the cluster assignments in a DPMM, allowing for flexible modeling of cluster membership + probabilities. + - f(x_1, ..., x_k; α_1, ..., α_k) = (1 / B(α)) * Π x_i^(α_i - 1) + - f(x_1, ..., x_k; α_1, ..., α_k) = probability density function of the Dirichlet + distribution with parameters α_1, ..., α_k + - B(α) = multivariate beta function evaluated at α_1, ..., α_k + - x_i = value between 0 and 1 representing the probability of assignment to cluster + i + - Dirichlet Process: a stochastic process used in Bayesian non-parametric models, where the + number of clusters can grow as more data is observed, allowing for flexible modeling of + complex data distributions without needing to specify the number of clusters in advance. + - DP(α, G_0) + - α = concentration parameter that controls the likelihood of creating new clusters + - G_0 = base distribution that defines the prior distribution over cluster + parameters + - Stick Breaking Process: a constructive definition of the Dirichlet process where a stick + of unit length is broken into pieces to represent the weights of the mixture components, + allowing for an infinite number of potential clusters. + - V_k ~ Beta(1, α) + - π_k = V_k * Π (1 - V_j) for j < k + - V_k = random variable drawn from a Beta distribution representing the proportion + of the stick allocated to cluster k + - π_k = weight of cluster k in the mixture model, calculated based on the + stick-breaking process + - Fuzzy Clustering + - extends traditional methods by allowing each data point to belong to multiple clusters + - with varying degrees of membership. + - captures ambiguity and soft boundaries in data + - particularly useful when the clusters overlap or boundaries are not clear-cut. + - Algos + - Fuzzy C-Means: + - Similar to K-means but with fuzzy memberships updated iteratively. + - Associate Rule Mining(Associate Rule) + - discover interesting relationships or associations between variables in large datasets. + - Market Basket Analysis + - often used in market basket analysis to find products that are frequently bought together. + - help retailers and analysts understand customer behavior by discovering item associations in transaction + data. + - e.g. if customers often buy bread and butter together, + - the rule {bread} → {butter} might be discovered, + - indicating that the presence of bread in a transaction increases the likelihood of butter + - being purchased as well. + - Data Mining + - used in various data mining applications to uncover hidden patterns and relationships in large datasets, + such as customer segmentation, web usage mining, bioinformatics, and fraud detection. + - + - Key Concepts + - Support: measures how frequently the items appear together in the dataset. + - Support(A → B) = P(A ∩ B) = (Number of transactions containing both A and B) / (Total number of + transactions) + - Confidence: measures how often item B is purchased when item A is purchased. + - Confidence(A → B) = P(B|A) = (Number of transactions containing both A and B) / (Number of + transactions containing A) + - Lift: measures how much more likely item B is purchased when item A is purchased, compared to when item A + is not purchased. + - Lift(A → B) = P(B|A) / P(B) = Confidence(A → B) / Support(B) + - Algorithms + - Apriori Algorithm + - generates candidate itemsets and prunes those that do not meet minimum support thresholds. + - FP-Growth (Frequent Pattern Growth) + - constructs a compact data structure called an FP-tree + - to efficiently mine frequent itemsets without candidate generation. + - Eclat (Equivalence Class Clustering and bottom-up Lattice Traversal) + - uses a depth-first search strategy to find frequent itemsets by intersecting transaction lists. + - Key Components(X -> Y) + - Antecedent (X): the item or itemset that appears on the left side of the rule (the "if" part). + - Consequent (Y): the item or itemset that appears on the right side of the rule (the "then" part). + - Rule: an implication of the form A → B, where A is the antecedent and B is the consequent. + - Rule Evaluation Metrics + - Support: + - measures how frequently both items X and Y appear together in the dataset. + - support(X → Y) = P(X ∩ Y) = (Number of transactions containing both X and Y) / (Total number of + transactions) + - Confidence: + - measures how often item B is purchased when item A is purchased. + - confidence(X → Y) = P(Y|X) = (Number of transactions containing both X and Y) / (Number of + transactions containing X) + - Lift: + - measures how much more likely item B is purchased when item A is purchased, compared to when item A is + not purchased. + - lift(X → Y) = P(Y|X) / P(Y) = confidence(X → Y) / support(Y) + - lift > 1 indicates a positive association between X and Y + - lift < 1 indicates a negative association between X and Y + - lift = 1 indicates no association between X and Y + - Applications + - Market Basket Analysis: discovering products frequently bought together to optimize store layouts and + promotions. + - Customer Segmentation: identifying groups of customers with similar purchasing behaviors. + - Web Usage Mining: analyzing user behavior on websites to improve navigation and content placement. + - Bioinformatics: discovering associations between genes, proteins, or diseases. + - Fraud Detection: identifying patterns of fraudulent transactions by analyzing associations between + transaction attributes. + - Dimensionality Reduction + - reduce the number of features while retaining key information. + - helps improve model performance, reduce overfitting and enhance interpretability. + - converts high-dimensional data into a lower-dimensional space while preserving important details. + - Techniques + - Feature Selection + - selects a subset of the original features based on their importance or relevance to the target + variable. + - methods include + - filter methods (e.g., correlation, mutual information) + - wrapper methods (e.g., recursive feature elimination) + - embedded methods (e.g., Lasso regression). + - Feature Extraction + - creates new features by transforming the original features into a lower-dimensional space. + - methods include: + - Principal Component Analysis (PCA) + - dimensionality reduction technique + - reduce the number of features in a dataset + - by transforming the original features into a new set of uncorrelated variables called + principal components. + - Principal Components + - linear combinations of the original features that capture the maximum variance in the + data. + - the first principal component captures the most variance, followed by the second, and so + on. + - uses linear algebra to transform data into new features called principal components. + - Principal Component Analysis (PCA) Steps + - Standardize the data + - normalize the data to have a mean of zero and a standard deviation of one + - Z = (X - μ) / σ + - X = original feature value + - μ = mean of the feature + - σ = standard deviation of the feature + - Compute the covariance matrix + - cov(x1, x2) = sigma i=1 to n[(x1 - μ1)(x2 - μ2)]/n-1 + - x1 and x2 are two features + - μ1 and μ2 are the means of those features + - Find the principal components + - PC1 = direction of maximum variance (most spread). + - w1 * x1 + w2 * x2 + ... + wp * xp + - PC1 = first principal component + - w1, w2, ..., wp = weights (eigenvectors) for each original feature + - x1, x2, ..., xp = original features + - PC2 = next best direction, perpendicular to PC1 and so on. + - w1 * x1 + w2 * x2 + ... + wp * xp + - PC2 = second principal component + - w1, w2, ..., wp = weights (eigenvectors) for each original feature + - x1, x2, ..., xp = original features + - Calculate eigenvalues(importance) and eigenvectors(directions of maximum variance) + - AX = λX + - A = covariance matrix + - X = eigenvector (direction of maximum variance) + - λ = eigenvalue (amount of variance captured by that direction) + - i.e. + - A acts on X it only stretches or shrinks X by the scalar λ. + - direction of X remains unchanged hence eigenvectors define "stable directions" of + A. + - Sort eigenvalues and select top k eigenvectors + - Pick the top direction + - PCA ranks eigenvalues and eigenvectors by the amount of information they capture + - select the top k eigenvectors corresponding to the largest eigenvalues to form the new + feature space. + - Transform the original data to the new feature space using the selected eigenvectors. + - new_data = original_data * selected_eigenvectors + - Singular Value Decomposition (SVD) + - factorizes a matrix into three matrices (U, Σ, V^T) + - used for dimensionality reduction, noise reduction and data compression. + - Missing Value Ratio + - entries in a dataset are left blank, marked as NaN, None or special strings like "Unknown" + - can lead to biased models and inaccurate predictions if not handled properly. + - features with a high ratio of missing values may be less informative and can be considered for + removal during feature selection. + - Types of Missing Data + - Missing Completely at Random (MCAR) + - missingness is unrelated to any observed or unobserved data. + - Missing at Random (MAR) + - missingness is related to observed data but not the missing data itself. + - Missing Not at Random (MNAR) + - missingness is related to the missing data itself. + - Methods to Handle Missing Data + - Deletion Methods + - Listwise Deletion: + - removes entire rows with missing values. + - Pairwise Deletion: + - uses all available data for each analysis, + - ignoring missing values on a case-by-case basis. + - Imputation Methods + - Mean/Median/Mode Imputation: + - fills missing values with the mean, median or mode of the feature. + - K-Nearest Neighbors (KNN) Imputation: + - fills missing values based on the values of the nearest neighbors. + - Multiple Imputation: + - creates multiple imputed datasets and combines results to account for + uncertainty in imputation. + - Regression Imputation: + - predicts missing values using a regression model based on other features. + - Backward Feature Selection/Elimination + - starts with all features and iteratively removes the least important ones based on model + performance. + - example: + - Predicting Sales Based on Advertising Budget + - Forward Feature Selection + - Factor Analysis + - identify hidden patterns or underlying relationships among a large set of variables. + - Features of factor analysis + - Data Reduction + - reduces the number of observed variables into a smaller set of latent factors. + - Latent Variables (Factors) + - unobserved variables that explain the correlations among observed variables. + - Correlation based + - group observed variables that are highly correlated with each other into factors. + - Variance Explained + - shared variance among observed variables that is captured by the latent factors. + - Exploratory and Confirmatory + - exploratory factor analysis (EFA) + - used to discover the underlying factor structure without predefined hypotheses + - identifies the number of factors and the loadings of observed variables on those + factors. + - uses techniques like eigenvalues and scree plots to determine the number of + factors to retain. + - scree plot is a graphical representation of the eigenvalues associated with each + factor, used to determine the + number of factors to retain in exploratory factor analysis (EFA). + - the plot typically shows a steep decline in eigenvalues followed by a leveling + off, + - and the point at which the decline levels off (the "elbow") is often used as a + criterion + - for selecting the number of factors to retain. + - confirmatory factor analysis (CFA) + - tests specific hypotheses about the factor structure + - used to confirm whether the data fits a predefined factor model based on + theoretical expectations. + - requires specifying the number of factors and the expected relationships between + observed variables and latent factors in advance. + - uses fit indices (e.g., Chi-square, RMSEA, CFI) to evaluate how well the specified + model fits the observed data. + - fit indices provide quantitative measures of how well the specified factor model + matches the observed data, + helping researchers assess the adequacy of their model and make informed decisions + about model modifications or acceptance + - Factor Loadings + - coefficients that represent the relationship between observed variables and latent + factors. + - Eigenvalues + - measure the amount of variance in the observed variables explained by each factor. + - Factor Rotation for clarity + - technique used to improve interpretability of factors by making loadings more + distinct. + - methods include + - Varimax (orthogonal) rotation + - simplifies the factor structure by maximizing the variance of squared loadings + for each factor. + - results in factors that are uncorrelated with each other. + - Promax (oblique) rotation + - permits factors to be correlated, + - allowing for a more realistic representation of relationships among factors. + - Types of Factor Extraction Methods + - Principal Component Analysis (PCA) + - focuses on maximizing the variance explained by the factors, rather than modeling the + underlying structure of the data. + - Canonical Factor Analysis (CFA) + - focuses on maximizing the correlation between observed variables and the factors. + - Also known as Rao's Canonical Factor Analysis, it identifies factors that are linear + combinations of the observed variables, maximizing the correlation between the factors + and the original variables. + - differs from PCA in that it seeks to find factors that are most strongly correlated + with the original variables, rather than simply maximizing variance. + - aims to identify factors that are most strongly correlated with the original + variables, rather than simply maximizing variance as in PCA. + - Common Factor Analysis (CFA) + - focuses on shared variance among observed variables to identify underlying factors. + - also known as Principal Axis Factoring (PAF), it identifies factors that explain the + common variance among observed variables, rather than the total variance as in PCA. + - Principal Axis Factoring (PAF) + - focuses on shared variance among observed variables to identify underlying factors. + - Maximum Likelihood Factor Analysis + - estimates factor loadings by maximizing the likelihood of the observed data given the + model. + - Alpha Factoring + - based on the reliability of the factors, aiming to maximize internal consistency. + - Image Factoring + - uses the squared multiple correlations of observed variables as initial communality + estimates. + - Unweighted Least Squares (ULS) + - minimizes the sum of squared differences between observed and reproduced correlation + matrices without assuming multivariate normal + - Working of Factor Analysis + - Determine Suitability of Data for Factor Analysis + - Bartlett's Test of Sphericity + - tests whether the correlation matrix is an identity matrix, indicating that + variables are unrelated and unsuitable for factor analysis. + - Kaiser-Meyer-Olkin (KMO) Measure of Sampling Adequacy + - assesses the proportion of variance among variables that might be common variance, + with values closer to 1 indicating suitability for factor analysis. + - value of 0.6 or above is generally considered acceptable for factor analysis. + - Choose the extraction method + - Principal Component Analysis (PCA) + - Canonical Factor Analysis (CFA) + - Common Factor Analysis (CFA) + - Principal Axis Factoring (PAF) + - Maximum Likelihood Factor Analysis + - Alpha Factoring + - Image Factoring + - Unweighted Least Squares (ULS) + - Factor Extraction + - extract factors based on the chosen method, resulting in factor loadings and + eigenvalues. + - Determine the number of factors to retain + - use criteria such as eigenvalues greater than 1, scree plot analysis, or parallel + analysis to decide how many factors to keep. + - Factor Rotation + - apply rotation techniques (e.g., Varimax, Promax) to improve the interpretability of + the factor solution by making loadings more distinct. + - Interpret the factors + - analyze the factor loadings to understand the underlying structure of the data and the + relationships between observed variables and latent factors. + - Compute factor scores (optional) + - calculate factor scores for each observation, which can be used in subsequent analyses + or modeling. + - Report and validate results + - present the factor analysis results, including the factor loadings, eigenvalues, and + fit indices, and validate the findings through cross-validation or replication + studies. + - Independent component analysis(ICA) + - used to separate mixed signals into their independent, non-Gaussian components. + - find a linear transformation of data that maximizes statistical independence among the + components. + - Statistical Independence + - components are independent if the joint probability distribution of the components + factorizes into the product of their individual distributions. + - P(X and Y) = P(X) * P(Y) or P(X ∩ Y) = P(X)P(Y) + - Assumptions of ICA + - Linearity: ICA assumes that the observed data is a linear mixture of independent + components, meaning that the observed signals are generated by linearly combining the + independent sources. + - Statistical Independence: ICA assumes that the underlying components are statistically + independent of each other, meaning that the occurrence of one component does not provide + any information about the occurrence of another component. + - Non-Gaussianity: ICA relies on the assumption that the independent components are + non-Gaussian, as Gaussian variables cannot be separated using ICA due to their symmetric + properties. + - e.g. Cocktail Party Problem + - multiple people talking simultaneously in a noisy environment + - ICA can separate the mixed audio signals into individual voices, allowing us to focus on a + specific speaker. + - Applications include: + - Blind Source Separation: separating mixed audio signals into individual sources (e.g., + separating voices in a recording). + - Image Processing: separating overlapping images or extracting features from images. + - Biomedical Signal Processing: analyzing EEG or fMRI data to identify independent brain + activity patterns. + - Financial Data Analysis: identifying independent factors that influence financial markets. + - Natural Language Processing: separating mixed topics in text data. + - Algorithms include: + - FastICA: an efficient algorithm for performing ICA using a fixed-point iteration scheme. + - Infomax: an algorithm that maximizes the mutual information between the input and output + signals to achieve independence. + - JADE (Joint Approximate Diagonalization of Eigenmatrices): an algorithm that uses + higher-order statistics to achieve independence. + - Kernel ICA: an extension of ICA that uses kernel methods to capture nonlinear + relationships between components. + - Steps in ICA: + - Centering: subtract the mean from the data to center it around the origin. + - Whitening: transform the data to have unit variance and be uncorrelated. + - Estimation of Independent Components: apply an algorithm (e.g., FastICA) to estimate the + independent components from the whitened data. + - Post-processing: optionally, scale or order the independent components for + interpretability or further analysis. + - Challenges in ICA: + - Identifiability: ICA can only recover independent components up to a scaling and + permutation ambiguity, meaning that the order and scale of the components cannot be + determined. + - Non-Gaussianity: ICA relies on the assumption that the independent components are + non-Gaussian, which may not always hold in practice. + - Computational Complexity: ICA can be computationally intensive, especially for large + datasets or high-dimensional data. + - Evaluation of ICA results: + - Assessing the independence of the estimated components using measures such as mutual + information or correlation. + - Comparing the estimated components to known sources (if available) to evaluate the + accuracy of the separation. + - Visualizing the components to interpret their meaning and relevance to the original data. + - Linear Discriminant Analysis + - Independent Component Analysis + - Non-negative Matrix Factorization + - Multidimensional Scaling + - Isomap + - Locally Linear Embedding + - Self-Organizing Maps + - Uniform Manifold Approximation and Projection (UMAP) + - use cases + - Text Categorization + - classify data into predefined categories based on content, such as spam detection, sentiment + analysis, topic classification etc. + - Image Recognition + - identify and classify objects or patterns in images, such as facial recognition, object + detection, medical imaging analysis etc. + - Dimensionality Reduction + - reduce the number of features while retaining key information, improving model performance and + interpretability. + - e.g. PCA can be used to reduce the dimensionality of a dataset with many features while + preserving as much variance as possible, making it easier to visualize and analyze the data. + - Gene Expression Analysis + - analyze gene expression data to identify patterns and relationships between genes, such as + clustering genes with similar expression profiles or identifying key factors that influence + gene expression. + - Customer Segmentation + - group customers based on their characteristics and behaviors to target marketing efforts and + improve customer satisfaction. + - Anomaly Detection + - identify unusual patterns or outliers in data that may indicate fraud, network intrusions, or + other abnormal behavior. + - Recommender Systems + - provide personalized recommendations to users based on their preferences and behavior, such as + recommending movies, products, or content. + - Natural Language Processing + - analyze and understand human language, such as sentiment analysis, topic modeling, and text + classification. + - Time Series Analysis + - analyze and forecast time-dependent data, such as stock prices, weather patterns, or sensor + data. + - Bioinformatics + - analyze biological data, such as gene expression, protein interactions, or evolutionary + relationships. + - Social Network Analysis + - analyze social networks to understand relationships and influence among individuals or groups. + - Image Compression + - Anomaly Detection + - Data Visualization + - t-Distributed Stochastic Neighbor Embedding (t-SNE) + - non-linear dimensionality reduction technique + - used for visualizing high-dimensional data in a lower-dimensional space (typically 2D or 3D). + - preserves local structure of the data, making it effective for visualizing clusters and patterns in + complex datasets. + - Steps in t-SNE: + - Compute pairwise similarities in the high-dimensional space using a Gaussian distribution. + - Compute pairwise similarities in the low-dimensional space using a Student's t-distribution. + - Minimize the Kullback-Leibler divergence between the two distributions using gradient descent. + - Applications of t-SNE: + - Visualizing high-dimensional data such as gene expression profiles, image features, or word + embeddings. + - Clustering analysis to identify groups of similar data points in a lower-dimensional space. + - Analyzing the structure of complex datasets, such as social networks or customer behavior data. + - Evaluating the performance of dimensionality reduction techniques by visualizing the resulting + embeddings. + - Non-Negative Matrix Factorization (NMF) + - break down large dataset into smaller meaningful parts while ensuring that all values remain + non-negative + - factorizes a non-negative matrix into two lower-rank non-negative matrices, + - A ≈ W * H + - A = original non-negative matrix (e.g., data matrix) + - W = basis matrix (non-negative) + - H = coefficient matrix (non-negative) + - used for dimensionality reduction and feature extraction. + - Steps in NMF: + - Initialize the factor matrices with non-negative values. + - Update the factor matrices iteratively using multiplicative update rules to minimize the + reconstruction error. + - Continue updating until convergence or a specified number of iterations is reached. + - Optimization in NMF: + - Multiplicative Update Rules + - ensure that the factor matrices remain non-negative throughout the optimization process. + - Alternating Least Squares (ALS) + - alternately fix one factor matrix and solve for the other using least squares optimization. + - The objective function to minimize is often the Frobenius norm + - of the difference between the original matrix and the product of the factor + - matrices, i.e., ||A - W * H||_F^2. + - Applications include + - topic modeling in text analysis, + - image processing, and + - collaborative filtering in recommender systems. + - Advantages of NMF: + - Interpretability: the non-negativity constraint leads to parts-based representations, making it + easier to interpret the resulting factors. + - Sparsity: NMF often produces sparse representations, which can be beneficial for feature selection + and reducing overfitting. + - Flexibility: NMF can be applied to various types of data, including text, images, and + collaborative filtering data. + - Challenges of NMF: + - Local Minima: NMF optimization can get stuck in local minima, leading to suboptimal factorization + results. + - Sensitivity to Initialization: the choice of initial values for the factor matrices can + significantly impact the convergence and quality of the factorization. + - Computational Complexity: NMF can be computationally intensive, especially for large datasets or + high-dimensional data. + - Variants of NMF: + - Sparse NMF: incorporates sparsity constraints to encourage sparse representations in the factor + matrices. + - Non-negative Tensor Factorization: extends NMF to higher-order tensors for analyzing + multi-dimensional data. + - Convolutional NMF: applies NMF to convolutional data, such as images or time series, to capture + local patterns and features. + - Evaluation of NMF results: + - Reconstruction Error: measures the difference between the original matrix and the product of the + factor matrices, with lower values indicating better factorization. + - Explained Variance: assesses how much of the variance in the original data is captured by the + factorization, with higher values indicating better performance. + - Interpretability: evaluates how well the resulting factors can be interpreted in the context of + the data and the specific application. + - Isomap + - non-linear dimensionality reduction technique + - preserves geodesic distances between data points in the high-dimensional space when mapping to a + lower-dimensional space. + - Steps in Isomap: + - Construct a neighborhood graph based on pairwise distances between data points. + - Compute the shortest path (geodesic distance) between all pairs of data points in the graph. + - Apply classical Multidimensional Scaling (MDS) to the matrix of geodesic distances to obtain the + low-dimensional embedding. + - Working of Isomap: + - Neighborhood Graph Construction + - connect each data point to its k nearest neighbors or all points within a certain radius. + - Geodesic Distance Calculation + - use algorithms like Dijkstra's or Floyd-Warshall to compute the shortest path between all pairs + of data points in the neighborhood graph, resulting in a matrix of geodesic distances. + - Dimensionality Reduction + - apply classical MDS to the geodesic distance matrix to find a low-dimensional embedding that + preserves the geodesic distances as much as possible. + - Manifold Learning + - techniques that seek to uncover the underlying structure of high-dimensional data by assuming + that the data lies on a lower-dimensional manifold embedded in the high-dimensional space. + - Example + - A 2D sheet of paper twisted into a 3D spiral + - Methods include: + - Isomap + - Locally Linear Embedding (LLE) + - Self-Organizing Maps (SOM) + - Uniform Manifold Approximation and Projection (UMAP) + - Applications include: + - Visualizing high-dimensional data such as gene expression profiles, image features, or word + embeddings. + - Clustering analysis to identify groups of similar data points in a lower-dimensional space. + - Analyzing the structure of complex datasets, such as social networks or customer behavior data. + - Evaluating the performance of dimensionality reduction techniques by visualizing the resulting + embeddings. + - LLE(Locally Linear Embedding) + - Example dataset + - Swiss roll reduction + - non-linear dimensionality reduction technique + - manifold learning method that seeks to uncover the underlying structure of high-dimensional data by assuming + that the data lies on a lower-dimensional manifold embedded in the high-dimensional space. + - assumes each data point and its neighbors lie on a locally linear patch of the manifold + - preserves local relationships between data points in the high-dimensional space when mapping to a + lower-dimensional space. + - Steps in LLE: + - Construct a neighborhood graph based on pairwise distances between data points. + - Compute the weights that best reconstruct each data point from its neighbors in the high-dimensional + space. + - Compute the low-dimensional embedding by minimizing the reconstruction error using the computed + weights. + - Applications include: + - Visualizing high-dimensional data such as gene expression profiles, image features, or word + embeddings. + - Clustering analysis to identify groups of similar data points in a lower-dimensional space. + - Analyzing the structure of complex datasets, such as social networks or customer behavior data. + - Evaluating the performance of dimensionality reduction techniques by visualizing the resulting + embeddings. + - Autoencoders + - e.g. + - house price prediction + - features like size, location, number of bedrooms, age of the house etc. + - too much features like: number of windows, color of the walls, type of flooring etc. + - may lead to overfitting and poor generalization to new data. + - customer churn prediction + - features like customer demographics, usage patterns, customer service interactions etc. + - customer segmentation + - image recognition + - sentiment analysis + - fraud detection - Reinforcement Learning - - Q-Learning - - Deep Q-Networks (DQN) - - Policy Gradient Methods + - Q-Learning + - Deep Q-Networks (DQN) + - Policy Gradient Methods - Acronyms - - ML: Machine Learning - - AI: Artificial Intelligence - - CNN: Convolutional Neural Networks - - RNN: Recurrent Neural Networks - - SVM: Support Vector Machines - - AWS: Amazon Web Services - - NLP: Natural Language Processing - - EDA: Exploratory Data Analysis - - ROC-AUC: Receiver Operating Characteristic - Area Under the Curve - - \ No newline at end of file + - ML: Machine Learning + - AI: Artificial Intelligence + - CNN: Convolutional Neural Networks + - RNN: Recurrent Neural Networks + - SVM: Support Vector Machines + - AWS: Amazon Web Services + - NLP: Natural Language Processing + - EDA: Exploratory Data Analysis + - ROC-AUC: Receiver Operating Characteristic - Area Under the Curve +- Data Quality + - GIGO: Garbage In, Garbage Out + - enrich dataset + - data from diverse sources and catalog + - Metrics to measure data quality + - Ratio of data to errors + - no of empty values + - Data transformation error rates + - Amount of dark data + - information not useful +- Pickle File + - serialized version of python model + - can be unpackaged for further usage + - used for: + - save feature engineered data into binary format + - and reload it while model training + - save final ML model as pickle file after achieving high accuracy + - Versioning of pickle file +- ML lifecycle + - Pipeline Development + - Gathering Data + - transforming row into cleanse data + - Data Preparation + - Data Wragling + - Analyze Data + - Training Phase + - Train Model + - Test Model + - Inference Phase + - Deployment + - retraining model +- read: + - https://www.geeksforgeeks.org/maths/central-limit-theorem/ + - https://www.geeksforgeeks.org/engineering-mathematics/continuous-random-variables/ + - https://www.geeksforgeeks.org/maths/poisson-distribution/ + - https://www.geeksforgeeks.org/maths/what-is-binomial-probability-distribution-with-example/ + - https://www.geeksforgeeks.org/data-science/normal-distribution-in-business-statistics/ diff --git a/Statistics/readme.md b/Statistics/readme.md new file mode 100644 index 0000000..c9357bb --- /dev/null +++ b/Statistics/readme.md @@ -0,0 +1,35 @@ +- Statistics: https://www.geeksforgeeks.org/maths/statistics/ + - Descriptive Statistics + - Measures of Central Tendency + - Mean + - Median + - Mode + - Measures of Variability/Dispersion + - Range + - Variance + - Standard Deviation + - Graphical Representations + - Histograms + - Box Plots + - Scatter Plots + - Inferential Statistics + - Hypothesis Testing + - Null Hypothesis + - Alternative Hypothesis + - p-value + - Confidence Intervals + - Margin of Error + - Confidence Level + - Regression Analysis + - Linear Regression + - Multiple Regression + - Logistic Regression + - Probability + - Probability Distributions + - Normal Distribution + - Binomial Distribution + - Poisson Distribution + - Bayes' Theorem + - Prior Probability + - Likelihood + - Posterior Probability \ No newline at end of file