In October of last year, a ruling against LinkedIn by The United States Court of Appeals for the Ninth District in San Francisco left many confused. How could the court rule in favor of a company, HiQ Labs, that used bots to scrape over 150 million professionals’ LinkedIn usernames, email addresses and phone numbers without authorization? Why would the court rule in support of bots? Well, that question may have been re-answered in early November of this year (yes, just a month ago). The Court of Appeals, in light of the United States Supreme Court’s recent ruling in the Van Buren v. United States case (involving the Computer Fraud and Abuse Act (CFAA)) reversed the year-old ruling that favored HiQ. It determined that LinkedIn may enforce its User Agreement against data scraping after all.
It seems like the perfect time to review what has been an interesting case related to data scraping, the scope of user data collection and its subsequent sale. It’s important to note that not all data buyers have nefarious intentions. Some are legitimate. Hopefully, it will serve to educate and help organizations prevent private data related to employees, customers and partners from being scraped and sold to threat actors looking to make a quick buck from unsuspecting victims.
How An Interesting, Game-Changing Court Case Began
This case started when HR company HiQ Labs sued LinkedIn in 2017 after the latter sent cease-and-desist letters to restrict HiQ’s access to LinkedIn’s website. Since 2015, HiQ had deployed bots to scrape public LinkedIn profiles. It wanted to gain insights on employee attrition. In turn, they provided that information to their customers. HiQ hadn’t operated illegally, but the idea that it scraped LinkedIn’s website didn’t sit well with them, even though its information is public. The fact that HiQ’s scraping rubbed LinkedIn the wrong way is understandable; after all, aren’t there rogue scrapers who sell information on the dark web? Yes, there are, but HiQ isn’t one of them. Unfortunately, though, information has a way of getting into the wrong hands.
In April 2021, hackers sold a file on the dark web that included information from 150 million LinkedIn users. Shortly after, another cache of previously-scraped LinkedIn data appeared for sale on another dark web forum The data included personal information, email and physical addresses, names, phone numbers, professional titles and work-related data on approximately 500 million LinkedIn users (yes, as in half a billion). It wasn’t the first time LinkedIn data had been compromised, though. In 2012, a data breach exposed the email addresses and passwords of 117 million LinkedIn users. At the time, it was one of the largest corporate data breaches to date.
Only two months later on June 22nd, a hacker announced the sale of LinkedIn user data that encompassed 700 million members on an underground dark net forum. If you’re keeping score, that means over 92% of LinkedIn users had their data scraped and aggregated with personal information from other sources. The data was available to anyone interested in buying the entire file or smaller chunks broken out by region and/or demographics. As is often the case, payment was due in Bitcoin or other cryptocurrencies to better hide the transaction’s seller and buyers.
Cross-referenced Data Leads to More Sales and More Attacks
In the underground market for personal data, some sellers claim to verify or cross-reference various pieces of data to guarantee accuracy. It’s a common practice related to usernames and passwords for credit and payment card data. Reviews of the archived LinkedIn data strongly suggest that data from multiple other sources were combined. This, of course, enhanced its value to cybercriminals and accelerated its abuse.
To verify the authenticity of the scraped data, approximately a million LinkedIn data files from the most recent leak were provided at no charge to researchers who would contact the sellers to determine authenticity. They discovered that full names, email addresses, phone numbers and other data could be correlated with LinkedIn users. This vast amount of user data is likely to become another valuable resource for cybercriminals to carry out phishing attacks, financial fraud, account takeover(s), impersonation, and other forms of targeted attacks.
But LinkedIn isn’t the only social media platform that has been targeted. In April of 2021, 533 million Facebook users had their personal data, including names, email IDs, phone numbers, birthdays, and other information, hacked and distributed via underground sites that specialize in buying and selling PII (Personally Identifiable Information).
Public or Private? LinkedIn Users Make That Call
LinkedIn users have the option of making their profile information public or private. HiQ only scraped the public profiles and data protection regulations only apply to private profiles, not to the publicly visible ones. The initial court ruling in 2019 also prohibited LinkedIn from blocking HiQ’s systematic scraping campaign during the litigation process. Doing so, the court ruled, would interfere with customer contracts HiQ had in place. Those customers rely on the scaped data.
According to the privacy research and product review website that broke the news of the LinkedIn data leak, the party (or parties) that posted the scraped data archive claimed to have obtained it by exploiting an official LinkedIn API (application programming interface). The big shocker — the asking price was only USD $5000!
While only a small percentage of LinkedIn users provide more than the basic information required to open an account, they can still be at risk. The archived data (even a minimal amount) contained data sourced from other databases. And that could mean LinkedIn accounts could be hacked, meaning email addresses, phone numbers and other information targeted by spammers and robocallers could be secured. If the same email addresses and passwords are used on other websites, owners are all but inviting credential stuffing attacks in which bot masters validate usernames and passwords on targeted websites and applications. With enough personal data, fraudsters can subject users (as well as their relatives and friends) to social engineering attacks that gain their trust and defraud victims in a variety of ways.
What Does This Mean for Organizations Whose Data Can Be Scraped?
What many don’t realize is that web scraping was not one of the earliest widespread uses of bots, but still is today. We’re in the era of big data. Most organizations collect, leverage and glean valuable insights from vast amounts of collected data. Unfortunately, this reliance on data means that data scraping will only grow. And that growth will open the door for more serious and regularly occurring threats. The precipitous growth in the use of Application Programming Interfaces (APIs) will facilitate this trend. APIs facilitate data transfers between web and mobile applications and the databases supporting them. All are illegal, get rich schemes that entice victims into unknowingly opening doors and exploit security vulnerabilities.
Even as security measures for data storage and transit keep getting better at protecting confidential data, bot technologies are getting better at mimicking humans. The latest 4th-generation bots learn and emulate how humans use websites and mobile applications. Unfortunately, they are extremely difficult to detect using conventional tools like web application firewalls (WAFs), access control lists (ACLs) and IP address reputation lists of known bad bot originators.
Your Best Bet to Protect Against Bad Bots
The way organizations can reliably and effectively secure their data from scraping and other types of bot attacks is to implement a dedicated bot management solution. The right solution protects websites, mobile applications and APIs that communicate with internal and external services. To learn more about how to protect against the “bad” bots, contact the cybersecurity experts at Radware. They will be happy to show you how to prevent data scraping and other harmful bot activities from attacking your website, mobile applications and APIs — and damaging your brand and bottom line. Contact them here. They would love to hear from you.