The Circuit logoTheCircuit

Methodology

Our methodology

How we made sense of 3 million charges over 19 years of cases

This data visualization is part of The Circuit, a collaborative effort led by two nonprofit, nonpartisan journalism organizations— the Better Government Association and Injustice Watch —in partnership with civic tech consultants DataMade, and initially supported by The Chicago Reporter and the Center for Survey Methodology at the University of Chicago’s Harris School of Public Policy. It allows readers to explore large-scale trends in how many and what types of criminal cases were brought in the Cook County Circuit Court from 2000 through 2018. Charts also include data on felony and misdemeanor cases broken down by month and year. For a full overview of the project, see our project introduction.

These records were scraped between April and August 2019 from the Cook County Clerk of the Circuit Court’s mainframe information system. A computer program was written to automatically access and record the case dockets and case management information for criminal cases. The records were later processed for use by PostgreSQL, an open-source relational database program.

The charges of a case are entered into the clerk’s system and include a description of the charge; whether the charge is a felony, misdemeanor or local ordinance violation; the class of the charge, which indicates its level of severity; and a reference to which Illinois statute makes the act a crime.

This visualization displays the types of charges (misdemeanor or felony) and dates cases were filed based on the clerk’s data. We did not attempt to fix apparent data entry errors, such as the rare instances in which homicide cases were marked as misdemeanors. That means some charges could be attributed to the wrong year or misclassified as a misdemeanor or felony if a clerk made a data entry error.

In criminal cases, a person often faces multiple charges; so to avoid double counting, we only used the top charge, or first-listed charge. For example, if a defendant has a murder as their top charge, we labeled that as a murder case even if the defendant also faced arson or robbery charges in the same case. The top charge is usually the most serious charge, and this method of characterizing cases is common in the criminological literature.

To get an accurate count of cases, we had to standardize the charge information. Most of the charges in the clerk’s data system had a statute citation, which references the chapter, section and subsection of the Illinois Compiled Statutes that defines the crime. However, each citation was entered in myriad ways. To match the citations in the clerk’s data to this list of offenses from the Illinois State Police, we built a custom parsing tool that breaks down a citation into standard form of chapter, section and subsection. The tool uses a statistical model called conditional random fields and was implemented in the programming language Python. Using this approach, we were able to match 84% of charges.

To match the remaining cases, we used dedupe, a record-matching program, to connect charge information from the clerk’s system with this official list. When we got a match, we used the Illinois State Police’s description as the standard description of the charge.

The standardization is not perfect, but it is good enough to reveal the large-scale trends that we are showing in this visualization. When we reviewed the matching of the clerk’s recorded charges to the Illinois State Police’s list of offenses, we had an accuracy of 92%.

To calculate this accuracy rate, we took a random sample of 100 distinct charges from the clerk’s system that we matched with our first method and hand-checked the matches. For 99% of the charges, this method matched the charge to the right offense. We then took a random sample of 100 distinct charges that the dedupe program proposed matches for. That sample had 79% accuracy. Finally, we took a random sample of 100 records that dedupe could not match, and this sample had 40% accuracy. A failure to match can be accurate because some charges are violations of local laws, not state laws, so they don’t have a corresponding statute citation. We then calculated the overall accuracy by combining the estimated group accuracy and weighting for the total number of records in each group.

Once we had standardized charges, we linked them to Uniform Criminal Reporting offenses. The UCR program, maintained by the FBI, involves nearly 18,000 law enforcement agencies of various types voluntarily reporting data on crimes. To keep reports consistent in different jurisdictions, the FBI groups specific charge and crime categories, so that they can be compared in different states.

We used documents from the Illinois State Police to link standardized charges to UCR offenses. Unfortunately, there is sometimes ambiguity in the UCR mapping. For example, the statute citation of 720-5/21-1.2 maps to the UCR offense of "Criminal Damage to Property under $150 by means of fire or explosive" and also to the UCR offense "Institutional Vandalism."

If we had the arrest report or another source of details about the charged offense, we could tell whether the alleged crime included a fire or explosive. However, with the information in the clerk’s data, we cannot. In such cases, we have assigned the charge to the broader UCR offense, in this case, "Institutional Vandalism."

Additionally, the state legislature sometimes revises statutes, which changes the statutes mapped to a UCR category. To adapt for changes to the liquor control act and driving under the influence, we had to widen two categories and used the wider UCR offenses for "Driving Under the Influence of Alcohol" and "Illegal Possession of Alcohol by Minor."

Once we had a method for counting cases by charge, we needed to decide which cases to count. A person who is arrested for a felony typically has two cases associated with the arrest. They start with a case in a municipal division, where a judge holds a preliminary hearing of the police’s criminal complaint. At this hearing, the judge reviews the details of the arrest to determine whether the police had probable cause for the search and arrest and whether there is sufficient evidence to move forward. If the judge decides the case can move forward, the state’s attorney’s office will usually file its own charges, which will start a new case in the criminal division.

We needed to avoid double counting felony cases that have a municipal division and criminal division phase. To do this, if a criminal division case and municipal division case are associated with the same central booking number, a number unique to a single arrest, we only counted the criminal division case and did not count the municipal division case.

We further filtered the charges by only considering cases where the top charge is a felony or misdemeanor. This filters out cases in which the top offense is a quasi-criminal local ordinance violation, where a citation or ticket is issued, such as parking violations, public intoxication or noise violations. For rare cases where a top charge is amended and therefore not listed as a felony or misdemeanor, we calculated the most common type of each charge. For amended charges otherwise recorded as a felony or misdemeanor in over 50% of occurrences in our data, we considered that charge to be its most common type.

All data analysis for these visualizations is handled in pure SQL orchestrated by a series of Makefiles. We built a separate, reproducible analysis schema in the remote database with all calculations needed to power the visualizations, such as charge and category counts by year and month. The database is connected to our Gatsby frontend through Hasura, and data is pulled into the static site with GraphQL queries.

Methodology for Race, Ethnicity and Gender

We also had to do extensive work to match defendants in Cook County to their race, ethnicity, and gender.

While the Cook County Circuit Court data we obtained in 2019 contained the race and ethnicity of some defendants, we found the court data was, in many cases, either missing or inaccurate. Most notably, we discovered the courts changed how it categorized Latinx defendants. The courts at one point identified Latinx defendants as "Spanish American" and used differing labels, as well. Then in 2009, those labels for Latinx defendants all but disappeared. The reasons why remain unclear.

More than 90% of the cases have or can be matched to a record in the courts data with another record with race, ethnicity, and gender information. To resolve the problem of the missing and inaccurate data, we sought race and ethnicity data from the Cook County Sheriff's Office’s booking system, which collects race and ethnic information for defendants after arrests. We linked the race, ethnicity, and gender data in the booking system to the same defendants we found in the court’s records, enabling us to fill in and correct the data. This data, along with some other data cleanup measures, gives us a more accurate estimate of racial, ethnic, and gender breakdowns overall and allows us to assess how racial, ethnic, and gender groups are charged with crimes.