Summary
Over the last few years, we’ve noticed an increase in the complexity of spam bots signing up to Magento 2 stores. They used to just spam customer accounts using the same email/email domain likely on ending in .qq.com
.ru
etc.
So it used to be easy to just bulk delete based on the TLD especially for sites that do no ship to China / Russia etc.
Whereas the last few years I’m observing them using standard mail providers such as gmail.com
or outlook.com
along with name fakers for the account names. This coupled with slower registration rate, and realistic fake data in the other fields make it much hard to batch delete customers.
There is a few ways we can still try to identify bot registrations though. They will require some human review to make sure we are not deleting genuine customers, and still might miss some accounts. But it should sort the vast majority of the accounts.
The SQL queries include joins against the customer_log
and sales_order
tables to help identify genuine accounts from spam ones. We use these tables to gather if the customer has placed any orders, and when their last login was.
Advertising
One reason for spam accounts being registered is them attempting to use your domain to advertise to email addresses to avoid spam blacklists. A good monitoring solution is likely to highlight this within your alerting/dashboards.
One way to try and identify these accounts in the database is searching the firstname
and lastname
columns for potential domains. As these fields usually will appear in the customer registration email, they are often targeted.
SELECT
e.entity_id, email, firstname, lastname, e.created_at, e.updated_at, l.last_login_at, o.cnt
FROM
customer_entity as e
LEFT JOIN
customer_log as l ON l.customer_id = e.entity_id
LEFT JOIN
(SELECT count(*) as cnt, customer_id FROM sales_order GROUP BY customer_id) as o on o.customer_id = e.entity_id
WHERE
firstname RLIKE '.*www.*|\\\\.(net|com|ru|co|cn|cz|uk)'
OR lastname RLIKE '.*www.*|\\\\.(net|com|ru|co|cn|cz|uk)'
GROUP BY e.entity_id
Advanced Email Addresses
It is fairly uncommon for most customers to include a + sign within their email address. Most internet uses does not even know this is an option. You will need to manually review each line, since a handful of these could potentially be legitimate customers but most will be bots.
SELECT
e.entity_id, email, firstname, lastname, e.created_at, e.updated_at, l.last_login_at, o.cnt
FROM
customer_entity as e
LEFT JOIN
customer_log as l
ON l.customer_id = e.entity_id
LEFT JOIN
(SELECT count(*) as cnt, customer_id FROM sales_order GROUP BY customer_id) as o on o.customer_id = e.entity_id
WHERE
email LIKE '%+%'
GROUP BY e.entity_id
Random Name Fields
Another signature of bot accounts I’ve observed recently is randomly generated strings similar to TGoHfngNexaUju
or kaYgQKXpOlURPSnI
within the name fields. These prove to be much more of a challenge to identify and bulk remove. Currently the best query I have found these is by filtering the name fields on the following conditions:
- Does not contain a space
- Length is more than 5 characters
- Is not full capitals
- If capitals make up 40% of the string
SELECT
e.entity_id, email, firstname, lastname, e.created_at, e.updated_at, l.last_login_at, o.cnt
FROM
customer_entity as e
LEFT JOIN
customer_log as l ON l.customer_id = e.entity_id
LEFT JOIN
(SELECT count(*) as cnt, customer_id FROM sales_order GROUP BY customer_id) as o on o.customer_id = e.entity_id
WHERE
(
firstname NOT LIKE '% %'
AND LENGTH(firstname) > 5
AND CAST(UPPER(firstname) as BINARY) != CAST(firstname as BINARY)
AND (LENGTH(REGEXP_REPLACE(firstname, '(?-i)[A-Z]', '')) / LENGTH(firstname)) < 0.6
) OR (
lastname NOT LIKE '% %'
AND LENGTH(lastname) > 5
AND CAST(UPPER(lastname) as BINARY) != CAST(lastname as BINARY)
AND (LENGTH(REGEXP_REPLACE(lastname, '(?-i)[A-Z]', '')) / LENGTH(lastname)) < 0.6
)
GROUP BY e.entity_id