Author: Ankit Chauhan
A massive amount of data is employed in Big Data. We have two major issues when it comes to data. The first challenge is how to collect a significant amount of data, and the second is how to interpret the data that has been collected. You’ll need a messaging system to overcome these obstacles.
Kafka is a messaging system that is distributed and has high throughput. As a replacement for a more traditional message broker, Kafka performs admirably. Kafka has better throughput, built-in partitioning, replication, and intrinsic fault-tolerance than other messaging systems, making it an excellent choice for large-scale message processing applications.
What is Kafka?
Apache Kafka is a scalable queue and publish-subscribe messaging system that can manage large volumes of data and send messages from one endpoint to another. Kafka may be used to consume messages both offline and online. To prevent data loss, Kafka messages are stored on discs and replicated throughout the cluster.
The following are some of Kafka’s advantages. −
- Reliability− Kafka has distributed, partitioned, replicated, and fault-tolerant reliability.
- Scalability− The Kafka communications system scales up and down with ease.
- Durability− Kafka makes use of a Distributed Commit Log, which ensures that messages are stored on discs as quickly as possible, resulting in a long-lasting system.
- Performance− For both publishing and subscribing messages, Kafka has a high throughput. Even with many TB of communications saved, it retains consistent performance.
Kafka is extremely quick and ensures that there will be no downtime or data loss.
Need for Kafka
Kafka is a centralized platform for processing all real-time data sources. Kafka allows for low-latency message delivery while also ensuring fault tolerance in the event of machine failure. It can handle a large number of different types of customers. Kafka is extremely fast, with 2 million writes per second. Kafka writes all data to disc, which essentially means that all writes go to the operating system’s page cache (RAM). Transferring data from a page cache to a network socket becomes much faster as a result of this.
What is Confluent Kafka?
Confluent Platform is a feature-rich data streaming platform that allows you to access, store, and manage data in real-time streams with ease. Confluent extends the benefits of Apache Kafka® with enterprise-grade functionality while minimizing the burden of Kafka management and monitoring. The original Apache Kafka® creators created it. Today, more than 80% of Fortune 100 organizations use data streaming technology, with Confluent accounting for the majority of them.
Confluent is a more comprehensive Apache Kafka distribution. It improves Kafka’s integration capabilities by adding tools for optimizing and maintaining Kafka clusters and methods for ensuring the security of the streams. Because of the Confluent Platform, Kafka is simple to set up and use.
Confluent’s software comes in three ways: a free, open-source streaming platform that makes it simple to get started with real-time data streams; an enterprise-grade version with more administration, operations, and monitoring tools; and a premium cloud-based version.
Comparison between Confluent and Apache Kafka:
|Parameters||Confluent Kafka||Apache Kafka|
|Terminologies||Data Streaming||Pub-Sub Platform|
|Technology||Data processing||Data Operations|
|Applications||Dynamic Development Requirements||Diverse Workloads|
|Performance||Better than Apache||Less than confluent|
|Pros||Ease of Data management||Open Source and Durable|
|Cons||Cannot tweak it much||Doesn’t have all data process capabilities.|
|Pricing||$0.11 per GB||Free of cost, but needs to pay for the cloud charges.|
What is Snowflake?
- Snowflake Inc. is a data warehousing company situated in Bozeman, Montana that uses cloud computing. After two years in stealth mode, it was publicly released in October 2014 after being formed in July 2012. The name was chosen as a nod to the founders’ passion for winter sports.
- Snowflake provides “data warehouse-as-a-service,” a cloud-based data storage and analytics solution. Cloud-based technology and software enable corporate users to store and analyze data. By developing and perfecting a cloud-based data platform, the business is credited with rejuvenating the data warehouse industry. Before Google, Amazon, and Microsoft, it was able to isolate computer data storage from computing.
- On the Forbes Cloud 100, the company is ranked first.
Use cases for Snowflake as a data warehousing system include:
- Storage– Data storage in the cloud is more scalable and often less expensive than on-premise storage.
- Reporting– Your team will be able to execute more business reporting faster and on a greater scale, thanks to data warehouses. Moving data to the cloud also makes it easier to rearrange information so that it is more valuable and understandable to business users.
- Analytics- You can execute data analysis at any scale with Snowflake to get the insights you need. When you integrate it with your larger systems, you’ll be able to bring value to your operational business applications. Consider your customer relationship management (CRM) application.
Some specific benefits include:
- Modern security- To keep your data safe, take advantage of features like “always-on” encryption. Perfect for industries that deal with sensitive information.
- Advanced analytics- By allowing concurrent, safe, regulated access to data, you can democratize analytics and migrate to real-time data streams.
- High scalability- An almost infinite number of workloads can be spun up and down. You may now save as much as you want and tackle resource-intensive undertakings.
- Easy manageability- It’s simple to get started with a data warehouse like Snowflake, and you don’t need a large crew to manage your low-level infrastructure.
- Strong accessibility- You’ll be able to access your data storage systems from anywhere, 24/7.
Steps to connect Confluent Kafka to Snowflake:
- Create a new Snowflake Account on Snowflake Trial
- Remember the Region you selected. (Note: -The Snowflake account and the Kafka cluster(Cluster Created in Confluent Kafka) must be in the same region.)
- Enter the login credentials.
- Create and Login into a new confluent account on Try Confluent for Free: Get Started on Any Cloud in Minutes
- On the Create cluster page, pick Basic after clicking Add cluster.
- Click Begin configuration. The page Regions/Zones appears. Select a cloud provider, location, and availability zone from the drop-down menus. Click Continue.
Note- The Snowflake account and the Kafka cluster(Cluster Created in Confluent Kafka) must be in the same region.) in our case: eu-west-2 (London)
- Click on Continue.
- Enter the Payment Details (It is just for the verification).
- Choose a cluster name, then examine your parameters, cost, and consumption before clicking Launch cluster.
- Next Step is to generate a Snowflake key pair.
A key pair must be generated before the connector can sink data to Snowflake. 2048-bit encryption is required for snowflake authentication (minimum) RSA. A Snowflake user account is used to store the public key. The Private key is added to the connector configuration.
- Download and extract openssl(Latest Version) for key pair generation – Google Code Archive – Long-term storage for Google Code Project Hosting.
- Open command prompt and change directory to where openssl is extracted.
- Example – Cd C:\MyFiles\Kafka-Snowflake\openssl-0.9.8k_WIN32\bin
- Generate a private key using the below OpenSSL command.
openssl genrsa -out snowflake_key.pem 2048
- Generate the public key referencing using the below private key command.
openssl rsa -in snowflake_key.pem -pubout -out snowflake_key.pub
- Two new files will be created in the Bin directory. (Private and Public files containing the keys will be generated in the bin folder with the details.)
- Once you have completed all these steps, go back to your newly created snowflake account (Worksheet).
- Our next step is to create a New User and assign a Public Key to it. You can do that by executing the below code and replacing the <public-key> with the “Public Key” that you recently generated.
Note – Be careful to include the public key in the statement as a single line.
User Name- confluent_user
USE ROLE securityadmin;
CREATE or REPLACE USER confluent_user RSA_PUBLIC_KEY='<public-key>’;
- Configuring user privileges
Complete the procedures below to assign the appropriate privileges to the newly minted user.
GRANT ROLE SYSADMIN to user confluent_user;
–Make use of a role that has the ability to establish and manage roles and privileges:
use role securityadmin;
–Create a Snowflake role that has the ability to interact with the connector.
create or replace role kafka_conn_role;
USE ROLE accountadmin;
— Create a new database
create or replace database kafka_db;
— Create a new schema
create or replace schema kafka_schema;
— Grant privileges on the database
grant usage on database kafka_db to role kafka_conn_role;
–– Grant usage on database kafka_db to role accountadmin;
use database kafka_db;
use schema “KAFKA_DB”.”KAFKA_SCHEMA”;
— Grant privileges on the schema
grant usage on schema kafka_schema to role kafka_conn_role;
grant create table on schema kafka_schema to role kafka_conn_role;
grant create a stage on schema kafka_schema to role kafka_conn_role;
grant create a pipe on schema kafka_schema to role kafka_conn_role;
— Give an existing user the custom role.
grant role kafka_connector_role to user confluent_user;
— Set the new role as the default:
alter user confluent_user set default_role=kafka_conn_role;
- Create a KAFKA TOPIC
- Open Cluster in the navigation bar on the left, then Topics, and then Create Topics on the Topics page.
- In the Topic name field, name the topic of your choice (In our case I’ll name it kafka_snowflake_topic). Select Create with Defaults.
- Producers and consumers can use the “kafka snowflake topic” topic, which is formed on the Kafka cluster.
- From the Data Integrations tab on the left, click on Connectors and search for Snowflake Sink Connector (Sink simply means the consumer and source means producer).
- To connect to the Snowflake Sink, click the symbol.
- Set up the connection
- Select one or more topics.
- Assign a name to your connector.
- Choose from AVRO, JSON SR (JSON Schema), PROTOBUF, or JSON as the input message format (data from the Kafka topic) (schemaless). To use a schema-based message format (for example, Avro, JSON SR (JSON Schema), or Protobuf), a valid schema must be accessible in the Schema Registry.
- Enter the kafka Cluster credential by clicking on the “Generate Kafka API Key & Secret”. Click on Download and Continue.
- Enter the Snowflake connection details:
- Connection URL: Enter the URL to your Snowflake account’s login page. Use the format https://<account_name>.<region_id>.snowflakecomputing.com:443.
- The https:// and 443 port numbers are optional. If you’re using AWS PrivateLink and your account is in the AWS US West region, don’t use the region ID.
- Connection user name: Fill in the user name you created before. (confluent_user)
- Private key:: As a single line, type the private key you created earlier. Only the key between —BEGIN RSA PRIVATE KEY— and —END RSA PRIVATE KEY— should be entered.
- Database name: Enter the name of the database containing the table into which you want to put rows.
- Enter the name of the Snowflake Schema that includes the table into which you want to insert rows.
- Set the maximum number of tasks the connector can use due to the buffer.size.bytes property value, each job is limited to a certain amount of topic divisions. A 10 MB buffer, for example, can only hold 50 topic partitions, a 20 MB buffer can only hold 25 topic partitions, a 50 MB buffer can only hold 10 topic partitions, and a 100 MB buffer can only hold 5 topic partitions.
- Click Next
- Verify the connection details and click Launch.
- Examine the status of the connector.
The connector’s state should change from Provisioning to Running. It could take a few moments.
- Manually upload a record from a confluent kafka topic to Snowflake to check the connectivity.
- Click on Topic on the left side.
- Select the topic that you previously made.
- In the message section -> Produce a new message.
- Write a message and click on produce.
- Check Snowflake
Verify that messages are populating your Snowflake database table after the connector has been started.
If you haven’t already done so, a Snowflake table and a Snowpipe are established automatically once the topic starts producing messages.
- Automate the process by using python to generate random records and sending them to the snowflake table via a confluent kafka topic.
- In your Python Terminal, import the following libraries.
- Copy this configuration snippet into your client code to connect to your Confluent Cloud Kafka cluster. Example:-
- To produce random records, use the following code. Here I am using the ‘kafka_snowflake_topic’ topic that we created earlier, to send the data to the Snowflake Sink Connector.
- Execute the code, which will produce random data and send it to the snowflake database via a ‘kafka snowflake topic’ topic at the same time.
- A snapshots of the Snowflake sink connector during real time data transferring.
- Check Snowflake
Verify that messages are populating your Snowflake database table after the connector has been started.
- Confluent Kafka vs. Apache Kafka: Top 6 Differentiating Factors (knowledgenile.com)
- What Is Snowflake And Why You Should Use It For Your Cloud Data Warehouse – Seattle Data Guy (theseattledataguy.com)
- Dead Letter Queue | Confluent Documentation