Data Exploration

Abstract: This project explores patterns and causes of flight delays in the United States using multiple government and open data sources. The primary goal is to examine how flight delay trends have changed from pre-COVID (2019) through post-COVID years (2022–2024), with a focus on specific airports and airlines. The analysis compares airport and airline specific performance, seasonal delay trends, and contributing factors such as weather and air-traffic volume.

Data Acquisition

Data Collection: In order to collect our data for flight delay information, we used information from the Bureau of Transportation Statistics (BTS). The data is directly from the government and monitors flight data for all years back to 1987. Much of the data is aggregated by year or month which is good for data exploration and summary, but may not be useful for predicting future flight delays. Thus we used the Detailed Statistics Departure which can generate specific flight data for a given set of days. The tool can take up to 31 days of input for a given year, departure airport and airline. It then outputs all the flights giving their takeoff, departure time, and any possible delays. The data also give the reason for any delay and the time lost due to the delay. As a result this dataset gave almost all answers to our research questions. It lets us analyze delay reasons, and we can make predictions using information such as takeoff time, departure airport, destination airport, airline company, and other factors covered in our research questions. For now we did not have to aggregate any other data sources. However we did have to aggregate the data from the tool using multiple queries. Since the tool can only give results for certain days for a certain airline and airport we varied these factors to generate our initial data. We first chose 20 random days throughout the year covering different months and different weekdays to capture a variety of flight data based on different traffic patterns. .

Data Source & Descriptions: Several data sources were evaluated for this project, each offering unique strengths and limitations. The FAA Operations & Performance Data provides comprehensive historical traffic counts and delay statistics, making it a reliable source for analyzing U.S. air traffic performance, though access may require a Login.gov account and could be impacted by federal transitions. The Operations Network (OPSNET) serves as the FAA’s official database for air traffic operations since 1990, allowing for analysis of peak travel periods but lacking detail on delay causes and flight level timing. FlightAware offers real-time and historical flight data across 45 countries but was excluded due to its costly subscription requirements. The Bureau of Transportation Statistics (BTS) provides public access to detailed delay causes and trends, though it is limited by static downloads and the absence of real-time functionality. The OpenSky Network delivers global flight tracking data useful for modeling and trend analysis but focuses more on aircraft movement than airport-specific delays. The National Airspace System (NAS) Status site gives real-time insights into national disruptions affecting air traffic, yet it lacks historical or flight-level data. Lastly, FlightLabs provides an API for real-time flight information specific to airports like Denver International, but much of its data overlaps with other, more comprehensive sources and was therefore not prioritized for this project.

Method of Collection: Data were collected using publicly available API interfaces and query tools provided by the FAA and BTS. For the BTS On-Time Performance dataset, a random sample of 24 total days (12th and 24th of each month in 2024) was selected to capture both peak travel days and regular operations. The dataset includes flight data collected from four airports selected to represent a range of sizes and geographic regions. These airports are LaGuardia (LGA), a large and high-density urban airport; Denver International (DEN), a major national hub; Kansas City International (MCI), a mid-sized regional airport; and Chicago O’Hare (ORD), one of the nation’s largest and busiest hubs. Additionally, nine major airlines such as American, Delta, Southwest, United, JetBlue, Spirit, Frontier, SkyWest, and Alaska were used in this process. For each airport and airline combination, flight-level data were retrieved through the BTS query interface and exported as CSVs.

Relevance to Research Questions: The datasets selected for this project are relevant to the research questions because they capture the multiple dimensions of flight delays that influence air travel performance. The FAA’s Operations & Performance Data and Aviation System Performance Metrics (ASPM) provide historical and real-time statistics on flight operations and delay patterns, making them essential for analyzing national trends and assessing how delays have evolved since the COVID-19 pandemic. The FAA Operations Network (OPSNET) complements this by offering daily airport-level data, which supports temporal and seasonal analyses such as whether peak travel periods or higher traffic volumes contribute to increased delays. FlightAware adds a valuable layer of real-time operational data sourced from global air-navigation providers, allowing for cross-validation of FAA records and potential incorporation into short-term delay prediction efforts. The Bureau of Transportation Statistics (BTS) dataset serves as the project’s foundation for flight-specific delay analysis, offering detailed records on delay causes, airline performance, and airport variation. The OpenSky Network expands this scope by providing high-frequency, real-time flight-tracking data that can inform predictive modeling and spatial analysis of flight movements. Finally, the National Airspace System (NAS) Status dashboard contributes contextual insight into system-wide events such as facility outages or operational disruptions. Collectively, these datasets align closely with the project’s objectives to measure post-COVID delay trends, identify major contributing factors such as staffing, weather, or traffic volume, and evaluate the reliability and accuracy of current delay-prediction systems.

Before cleaning, the raw dataset contained only a few key flight metrics like departure delay, scheduled elapsed time, and actual elapsed time. The structure was minimal, and several records had incomplete or zero values. At this stage, the data primarily reflected timing differences without context for the underlying causes of delays or the performance of different flights. After cleaning and transformation, the dataset was enriched with additional computed and categorical features, including Primary Delay Reason, Primary Delay Percentage, Number of Delay Reasons, In-Flight Delay, and Total Delay. These new fields standardized delay categories and quantified their contribution to total delay times, enabling more robust descriptive and comparative analyses across flights and airlines.

Visualization: Pre Cleaned — **Figure 1.** Percentage of Delayed Flights by Airline (ORD, MCI, DEN, LGA – 2024 20-Day Sample).

Visualization: Post Cleaned — **Figure 1.** Percentage of Delayed Flights by Airline (ORD, MCI, DEN, LGA – 2024 20-Day Sample).

Data Mining Project Website

Data Exploration

Data Acquisition

Data Cleaning and Preprocessing

Visualization & Summaries